Unit 1
Unit 1
UNIT – I
Theory of Parallelism, Parallel computer models, The State of Computing, Multiprocessors and
Multicomputers, Multivector and SIMD Computers, PRAM and VLSI models, Architectural development
tracks, Program and network properties, Conditions of parallelism, Program partitioning and Scheduling,
Program flow Mechanisms, System interconnect Architectures.
Introduction
Parallelism means doing many things at the same time. Instead of finishing one work and
then starting another, the computer works on multiple instructions together. This saves time
and increases performance.
Modern processors (multi-core CPUs, GPUs, supercomputers) are based on this idea.
Types of Parallelism
1. Bit-Level Parallelism
o Processor handles more bits in one step.
o Example: 64-bit processor works faster than 32-bit.
2. Instruction-Level Parallelism (ILP)
o Many instructions run at the same time.
o Done using pipelining and superscalar execution.
3. Loop-Level Parallelism
o Loop tasks are divided among processors.
o Example: In matrix multiplication, each processor calculates part of the result.
4. Task-Level Parallelism
o Different functions run at the same time.
o Example: One thread handles input, another processes output.
5. Job-Level Parallelism
o Two or more programs run together.
o Example: A user can listen to music while browsing the internet.
S=1/(1−P)+P/N
where P = parallel portion, N = number of processors.
2. Gustafson’s Law
o Argues that with increasing problem size, parallelism becomes more effective.
o Suggests scalability improves with larger workloads and more processors.
Applications
Weather forecasting, scientific simulations.
Machine learning and AI.
Image and video processing.
Robotics and real-time systems.
Advantages
Saves execution time.
Increases efficiency.
Handles large problems easily.
Disadvantages
Needs special hardware.
Programming is difficult.
Communication between tasks may slow down execution.
Applications
SISD: Personal computers, simple microcontrollers.
SIMD: Image/video processing, scientific simulations, AI training.
MISD: Rare, used in fault-tolerant systems.
MIMD: Most real-world systems like multicore CPUs, servers, clusters, cloud
computing.
Disadvantages
Some models (MISD) are impractical.
Complex design and programming.
Higher cost for hardware and maintenance.
Introduction
The state of computing refers to the current trends, challenges, and progress in computer
systems.
Initially, computers were sequential uniprocessors, but due to the end of Moore’s Law
scaling and the power wall, the industry moved towards parallelism. Today’s computing is
dominated by multi-core processors, GPUs, cloud computing, and AI accelerators.
Evolution of Computing
1. First Generation (1940s–50s) – Vacuum tubes, sequential execution.
2. Second Generation (1960s) – Transistors, faster uniprocessors.
3. Third Generation (1970s) – Integrated circuits, pipelining started.
4. Fourth Generation (1980s–90s) – Microprocessors, early multiprocessors.
5. Fifth Generation (2000s onwards) – Parallel and distributed systems, cloud
computing, GPUs, AI hardware.
Modern State of Computing
1. Multi-Core and Many-Core CPUs
o Processors now have multiple cores (quad-core, octa-core, 64+ cores in
servers).
o Each core executes tasks in parallel.
2. GPUs and Accelerators
o GPUs provide SIMD parallelism for graphics, AI, and scientific tasks.
o TPUs (Tensor Processing Units) and FPGAs are used for machine learning.
3. Cloud Computing
o Shared, on-demand resources available via the internet.
o Examples: AWS, Microsoft Azure, Google Cloud.
4. Edge and Mobile Computing
o Processing happens closer to data sources (IoT devices, mobile phones).
o Reduces latency in real-time applications.
5. Big Data and AI
o Massive datasets require parallel and distributed computing.
o AI/ML workloads dominate supercomputing and cloud systems.
Diagram: Evolution
Sequential → Pipelining → Superscalar → Multi-core → Many-core → Cloud/AI
Applications
Cloud services (Google Drive, AWS).
AI (Chatbots, Recommendation engines).
Real-time systems (autonomous vehicles, robotics).
Supercomputers for climate modeling, drug discovery.
Types of Multiprocessors:
1. Symmetric Multiprocessors (SMP):
o All processors are equal and share the same memory.
o Example: Modern multi-core CPUs.
2. Non-Uniform Memory Access (NUMA):
o Memory is divided among processors, but still accessible globally.
o Access time depends on memory location.
Advantages:
Easy to program (shared memory model).
High performance for medium-scale systems.
Disadvantages:
Memory contention (bottleneck if many processors access memory).
Limited scalability (not good for thousands of processors).
Examples:
Computer clusters, distributed supercomputers.
MPI (Message Passing Interface) is commonly used.
Advantages:
Highly scalable (can connect thousands of processors).
No memory contention, since each processor has private memory.
Disadvantages:
Difficult to program (explicit message passing).
Higher communication overhead.
Applications
Multiprocessors: General-purpose servers, desktops, multi-core processors in
laptops.
Multicomputers: Supercomputers, large-scale simulations, scientific computing,
cloud systems.
Single global address space (all processors Multiple private address spaces
3. Address Space
see the same memory). (each processor has its own).
More expensive due to shared memory Cheaper and easier to build using
5. Cost
hardware. networked PCs.
10. Speed Faster for small to medium systems. Faster for large-scale systems.
Multi vector and SIMD Computers
Introduction
Parallel computing systems aim to process large volumes of data efficiently by executing
multiple operations at once. Two important categories are SIMD (Single Instruction
Multiple Data) computers and Multivector computers. Both are widely used in scientific,
engineering, multimedia, and AI applications, but they differ in design and working.
1. SIMD Computers
Definition: A SIMD computer has one control unit that broadcasts the same
instruction to many processing elements (PEs), but each PE works on a different
data item at the same time.
This is a form of data parallelism because the same operation is applied to many data
elements simultaneously.
Characteristics:
1. Single instruction → multiple data streams.
2. All processors execute in lockstep (synchronized).
3. Very efficient for applications with regular data structures like arrays and matrices.
Examples:
GPUs (Graphics Processing Units).
Array processors like ILLIAC IV, MasPar.
Supercomputers using SIMD for scientific simulations.
Applications:
Image/video processing.
Signal processing.
Matrix multiplication.
Weather forecasting.
Advantages:
Simple control mechanism.
High performance for vectorizable tasks.
Saves power and reduces instruction overhead.
Limitations:
Not good for irregular tasks or tasks with different control flows.
Idle processors if data size is uneven.
2. Multi vector Computers
Definition: Multivector computers use vector processors that can execute operations
on entire arrays of data (vectors) in a single instruction.
They are more powerful than SIMD because they can handle multiple vector
operations and support vector registers.
Characteristics:
1. Operates on long vectors of data (e.g., arrays of 1000 numbers).
2. Special vector instructions (e.g., vector add, vector multiply).
3. Reduces instruction fetch/decode overhead by applying one instruction to many
elements.
Examples:
CRAY vector supercomputers (CRAY-1, CRAY-XMP).
NEC SX series.
Modern CPUs with vector extensions (Intel AVX, ARM NEON).
Applications:
Scientific simulations (physics, astronomy).
Engineering computations.
Machine learning & AI acceleration.
Linear algebra and matrix-based problems.
Advantages:
Faster execution for vectorizable problems.
Reduces instruction overhead.
Excellent for scientific workloads.
Limitations:
Requires vector-friendly programs.
More complex and costly hardware.
Works on individual data items across Works on entire vectors (arrays) in one
Data Handling
many processors. go.
Central control unit broadcasts Vector unit fetches and executes vector
Control
instructions. instructions.
Flexibility Best for simple data-parallel tasks. Best for complex scientific computations.
Execution Lockstep execution, processors must Can perform multiple vector instructions
Style follow same instruction. simultaneously.
Diagram
SIMD: [Instruction] → P1(Data1), P2(Data2), P3(Data3), P4(Data4)
Conclusion
SIMD computers are ideal for simple, data-parallel applications such as graphics,
image processing, and matrix operations.
Multivector computers are more powerful, designed for scientific and engineering
applications where large vector operations dominate.
Together, they form the foundation of high-performance computing (HPC) and
modern processors (CPU + GPU hybrid systems).
Architecture of PRAM
1. Processors (P1, P2, … Pn): A large number of simple processors.
2. Shared Memory: A single global memory accessible by all processors.
3. Control: Synchronous execution (all processors run in lockstep).
4. Uniform Access: Each processor can read/write from any memory cell in unit time.
Advantages of PRAM
Simplifies the analysis of parallel algorithms.
Provides a clean mathematical model.
Helps estimate speedup and efficiency of parallel programs.
Limitations of PRAM
Unrealistic because actual hardware cannot support unlimited processors and
constant-time memory access.
Ignores communication delays and memory contention.
Used only as a theoretical tool, not a real implementation.
Applications
Designing and analyzing algorithms for sorting, searching, matrix multiplication, and
graph problems.
Teaching parallel computing concepts.
Diagram
P1 P2 P3 P4 ... Pn
\ | /
[Shared Global Memory]
VLSI (Very Large Scale Integration) Model
Introduction
VLSI (Very Large Scale Integration) is the process of integrating thousands to
millions of transistors onto a single chip.
In parallel computing, the VLSI model studies how efficiently an algorithm can be
mapped to hardware by considering chip area and execution time.
Key Points
1. Processing Elements (PEs): Many small processors are embedded on a single chip.
2. Interconnection Network: Communication between PEs is done via on-chip
networks.
3. Area–Time Complexity:
o A = Area of chip (proportional to number of processors + wires).
o T = Time taken to execute the algorithm.
o Efficiency measured using AT² (area–time squared).
4. Goal: Design hardware that minimizes chip area and computation time.
Advantages
High speed and performance due to massive parallelism.
Compact and low-cost compared to separate processors.
Suitable for special-purpose architectures like systolic arrays, GPUs, and AI
accelerators.
Limitations
Physical constraints: chip size, power, and heat dissipation.
Complex and costly design process.
Limited flexibility (hardware is fixed once fabricated).
Applications
Design of CPUs and GPUs.
AI accelerators (e.g., Google TPU, NVIDIA Tensor Cores).
Digital Signal Processing (DSP) and image processing.
Network-on-Chip in multicore processors.
Diagram
+----------------------------------+
| Multiple Processing Elements |
| +---+ +---+ +---+ +---+ |
| |PE1| |PE2| |PE3| |PEn| |
| +---+ +---+ +---+ +---+ |
| Interconnection Network |
+----------------------------------+
Differences between PRAM and VLSI Models
Aspect PRAM Model VLSI Model
7. Practicality Purely theoretical and idealized (not Realistic, used in physical VLSI chip
Aspect PRAM Model VLSI Model
practical). design.
Conclusion
PRAM model is best for theoretical study of parallel algorithms.
VLSI model is best for practical implementation of algorithms in hardware.
Together, they bridge the gap between algorithm theory and hardware design.
Introduction
The evolution of computer architecture follows different development tracks
based on technological improvements and performance demands.
These tracks describe the historical progression from sequential computing to
parallel and distributed computing.
Introduction
In parallel computing, program behavior and network organization determine how
efficiently tasks can be executed.
Program properties describe how a program can be divided into parallel parts.
Network properties describe how processors communicate in a parallel system.
1. Program Properties
1. Parallelism
o Amount of work that can be done simultaneously.
o Measured as the degree of parallelism = (total operations) / (time steps).
2. Granularity
o Coarse-grain: Large tasks, fewer communication needs.
o Fine-grain: Small tasks, frequent communication.
3. Data Dependence
o Determines whether instructions can be executed in parallel.
o Types: true dependence, anti-dependence, output dependence.
4. Control Dependence
o Arises from conditional branches (e.g., if–else).
o Reduces parallel execution possibilities.
5. Computational Load Balance
o Work must be evenly distributed among processors to avoid idle time.
2. Network Properties
1. Topology
o Structure of processor interconnection (bus, mesh, hypercube, tree, etc.).
2. Diameter
o Maximum number of links between any two processors (smaller diameter =
faster communication).
3. Connectivity
o Number of alternative paths between processors → improves fault tolerance.
4. Bisection Width/Bandwidth
o Minimum number of links that must be cut to divide the network into two
equal halves.
o Higher = better data transfer capacity.
5. Latency & Bandwidth
o Latency: Time to deliver a message.
o Bandwidth: Maximum data transfer rate.
Diagram (Conceptual)
Program Properties → Parallelism, Granularity, Dependence, Balance
Network Properties → Topology, Diameter, Bandwidth, Latency
Conditions of Parallelism
Introduction
Parallelism means executing multiple operations simultaneously.
For a program to be executed in parallel, certain conditions must be satisfied so that
tasks can run independently without conflicts.
Diagram (Conceptual)
Conditions of Parallelism:
├── Data Dependence
├── Control Dependence
├── Resource Dependence
├── Granularity
├── Load Balance
└── Communication/Latency
Conclusion
Parallelism is only possible when dependencies are minimized, resources are
available, and tasks are balanced.
These conditions ensure that programs run efficiently in parallel without idle
processors or conflicts.
1. Program Partitioning
Partitioning = dividing a program into smaller units (tasks or data blocks).
Functional Partitioning
o Different processors perform different functions.
o Example: One processor handles input, another does computation, another
does output.
Data Partitioning
o Input data is split into smaller blocks and distributed among processors.
o Example: Splitting an array for parallel sorting.
Granularity in Partitioning
o Fine-grain: Small tasks, more communication overhead.
o Coarse-grain: Larger tasks, less communication, better efficiency.
2. Program Scheduling
Scheduling = deciding the order and allocation of tasks to processors.
Static Scheduling
o Tasks are assigned to processors before execution.
o Works well when tasks are predictable.
o Example: Loop iterations divided equally among processors.
Dynamic Scheduling
o Tasks are assigned at runtime depending on resource availability.
o Handles irregular or unpredictable workloads better.
Load Balancing
o Work must be distributed evenly among processors.
o Prevents idle processors and bottlenecks.
Diagram (Conceptual)
Program Partitioning → Functional / Data
↓
Program Scheduling → Static / Dynamic
↓
Balanced Parallel Execution
Conclusion
Partitioning divides the program into manageable parallel tasks.
Scheduling arranges those tasks efficiently on processors.
Together, they ensure high performance, efficiency, and scalability in parallel
computing.
Introduction
In computer architecture, program flow mechanisms define how instructions are
executed and how the control/data flows in the system.
Different mechanisms support different types of parallelism.
Diagram (Conceptual)
Program Flow Mechanisms:
├── Control Flow → Sequential execution
├── Data Flow → Executes when data ready
├── Demand Driven → Executes when needed
└── Reduction → Expression reduction
Conclusion
Control flow is best for sequential tasks.
Data flow and reduction models exploit parallelism.
Demand-driven flow avoids redundant computation.
Together, these mechanisms provide different ways to achieve efficient execution in
modern architectures.
Diagram (Conceptual)
System Interconnects:
├── Bus
├── Crossbar
├── Multistage (Omega, Banyan)
├── Mesh / Torus
├── Hypercube
└── Tree / Fat-Tree
Conclusion
Bus is simple but limited.
Crossbar is powerful but costly.
Multistage, Mesh, Hypercube, and Tree networks provide scalable and efficient
interconnections.
Choice of interconnect depends on system size, cost, and performance
requirements.