0% found this document useful (0 votes)

8 views95 pages

U1&u2 Padcom-25

The document discusses the scope, issues, goals, limitations, and challenges of parallel computing, highlighting its transformative role in various fields such as scientific simulations, big data, AI, and cloud computing. It also covers the relationship between parallelism and concurrency, multiple instruction streams, and various parallel architectures like pipeline and multi-processor systems. Additionally, it addresses the advantages and disadvantages of using multiple instruction streams, providing real-world examples such as NVIDIA GPUs in data centers.

Uploaded by

boppana200312

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views95 pages

U1&u2 Padcom-25

Uploaded by

boppana200312

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 95

PARALLEL DISTRUBUTED

COMPUTING
Scope of parallel
computing
• The scope of parallel computing spans a wide array of fields,
driving advancements by enabling faster, efficient, and scalable solutions
to complex problems. Below are key domains where parallel
computing plays a transformative role:
1. Scientific and Engineering Simulations
• It enables faster simulations for complex physical and engineering
problems.
2. Big Data and Analytics
• Parallel computing processes and analyzes large datasets efficiently.
Scope of parallel
computing
3. High-Performance Computing (HPC)
• It powers supercomputers to solve advanced scientific challenges.
4. Artificial Intelligence (AI) and Machine Learning (ML)
• Accelerates AI model training and inference for large-scale applications.
5. Cloud Computing and Distributed Systems
• Enhances scalability and speed in cloud and distributed environments.
Scope of parallel
computing
6. Quantum Computing
• Leverages parallelism at the quantum level for solving specific problems.
7. Graphics and Gaming
• Enables real-time rendering and physics simulations in games and AR/VR.
8. Computational Biology
• Processes genetic and biological data for research and drug discovery.
Issues of parallel
computing
• The issues in parallel computing arise from the challenges of
coordinating multiple processes, optimizing performance, and ensuring
scalability. Below are key issues faced in parallel computing:
1. Synchronization Overhead
• Managing synchronization among parallel tasks adds latency and
complexity.
2. Load Balancing
• Uneven workload distribution across processors reduces efficiency.
Issues of parallel
computing
3. Communication Delays
• Data exchange between processes or nodes can cause performance bottlenecks.
4.Scalability
• Scaling applications to larger systems often leads to diminishing returns.
5. Fault Tolerance
• Detecting and recovering from failures in a parallel environment is challenging.
6.Data Dependency
• Dependencies between tasks can limit parallel execution and slow down processes.
Issues of parallel
computing
7.Debugging and Testing
• Identifying resolving issues in parallel programs is more
complex
and than in sequential ones.
8. Hardware Limitations
• Performance is constrained by memory bandwidth,
processor speed, and interconnects.
Goals of parallel
computing
1. Speedup
• Reduce execution time for complex computations.
2. Scalability
• Enable systems to handle increasing workloads effectively.
3. Resource Utilization
• Maximize the use of available processing resources.
.
Goals of parallel
computing
4. Problem-Solving Capability
• Tackle problems that are infeasible for sequential computing.
5. Energy Efficiency
• Minimize power consumption while maintaining high performance.
6. Real-Time Processing
• Achieve low-latency performance for time-sensitive applications.
Goals of parallel
computing
7. Fault Tolerance
• Ensure reliability and recovery in case of failures.
8. Cost-Effectiveness
• Optimize computing costs by leveraging parallel architectures.
9. High Throughput
• Increase the volume of tasks processed in a given time.
Limitations of parallel
computing
1. Synchronization Overhead
Coordinating multiple processes introduces delays and complexity.
2. Communication Delays
Data transfer between processors or nodes can slow down overall
performance.
3. Load Balancing Issues
Uneven distribution of work among processors reduces efficiency.
4. Fault Tolerance Challenges
Ensuring reliable operation and recovery from failures is difficult.
Limitations of parallel
computing
5. Programming Complexity
Writing and debugging parallel programs requires advanced expertise.
6. High Cost
The cost of specialized parallel hardware and software infrastructure is often
prohibitive.
7. Energy Consumption
Parallel systems consume significant power, increasing operational costs.
8. Hardware Limitations
Processor speed, memory bandwidth, and interconnect performance can act as
bottlenecks.
Challenges in Parallel
Computing
1. Synchronization Overhead
• Coordinating multiple processes or threads requires synchronization, which
adds complexity and delays.
2.Communication Overhead
• Frequent data exchange between processors can slow down performance
due to network latency or bandwidth limitations.
3.Load Balancing
• Uneven workload distribution across processors leads to inefficiencies and
reduced performance.
Challenges in Parallel
Computing
4. Scalability
Performance gains diminish as the number of processors increases,
especially for non-parallelizable tasks.
5. Fault Tolerance
Detecting and recovering from hardware or software failures in parallel
systems is challenging.
6. Programming Complexity
Challenges in Parallel
Computing
Developing efficient parallel algorithms and debugging them requires
specialized knowledge and tools.
7.Data Dependency
Inter-task dependencies can limit the level of achievable parallelism,
slowing execution.
8.Energy Efficiency
High energy consumption in large-scale parallel systems affects
sustainability and operational costs.
Relationship between
parallelism and
concurrency.
• Parallelism refers to performing multiple tasks simultaneously, where
tasks are actually executed at the same time, typically using multiple
processors or cores. It is about achieving faster execution by dividing a
task into smaller sub-tasks and executing them concurrently across
multiple hardware resources.
• Concurrency refers to the concept of managing multiple tasks at the
same time, but not necessarily simultaneously. It involves the ability of a
system to handle multiple tasks in an overlapping time frame by
switching between them. Concurrency can be achieved even with a
single processor by quickly switching between tasks (time-sharing).
Relationship between
parallelism and
concurrency.
Key Differences and
Relationship
• Parallelism is a subset of concurrency: Parallelism requires multiple
processors or cores to execute tasks simultaneously, while concurrency
can occur with a single processor by interleaving tasks.
• Concurrency enables efficient management of multiple tasks, whereas
parallelism accelerates processing by dividing tasks and running them at
the same time.
• A system can be concurrent without being parallel, but it cannot be
parallel without being concurrent.
Parallelism with Multiple
Instruction Streams
Multiple Instructions
Streams
Multiple instruction streams refer to the concept of executing multiple sequences of
instructions simultaneously. This approach is fundamental to parallel computing and is used to
enhance the performance and efficiency of computer systems. Here are some key concepts
and techniques related to multiple instruction streams:
1. Multiple Instruction, Multiple Data (MIMD)
MIMD is a classification of parallel computer architecture where multiple processors execute
different instructions on different data. This approach allows for high levels of parallelism
and is common in many modern supercomputers and multiprocessor systems.
Multiple Instructions
Streams
2. Single Instruction, Multiple Data (SIMD)
In SIMD, a single instruction is executed on multiple data points simultaneously. This is often used in
vector processors and graphics processing units (GPUs) where the same operation is applied to
large sets of data, such as in image processing or scientific computing.
3. Multithreading
Multithreading is a technique where multiple threads (lightweight processes) are created within a
single process to execute different instruction streams. This allows for concurrent execution and
better utilization of CPU resources. There are several types of multithreading:
Multiple Instructions
Streams
• Coarse-grained multithreading: Switches threads only on long-latency events, such as
cache misses.
• Fine-grained multithreading: Switches threads at each instruction cycle.
• Simultaneous multithreading (SMT): Allows multiple threads to issue instructions
to a superscalar processor's multiple functional units simultaneously. Hyper-threading in
Intel processors is an example of SMT.
4. Multiprocessing
Multiprocessing involves using two or more CPUs within a single computer system to
execute multiple instruction streams concurrently. This can be implemented in various
ways:
Multiple Instructions
Streams
• Symmetric multiprocessing (SMP): All processors share a single, main memory and are capable of
running any process.
• Asymmetric multiprocessing (AMP): Each processor is assigned specific tasks, and one processor
controls the system.
5. Parallel Programming Models
These models provide frameworks and tools to write programs that can execute multiple instruction
streams. Common models include:
• Message Passing Interface (MPI): A standardized and portable message-passing system designed
to function on parallel computing architectures.
• OpenMP: An API that supports multi-platform shared-memory multiprocessing
programming in C, C++, and Fortran.
Pipelinin
g
Pipelining is a technique where multiple instruction phases (such as fetch, decode, execute, memory
access, and write-back) are overlapped. This allows the next instruction to be fetched while the current
one is being decoded, and so on.
•Stages in a Pipeline: Typical stages include instruction fetch, instruction decode, execute, memory
access, and write-back.
•Pipeline Depth: Refers to the number of stages in the pipeline.
•Pipeline Hazards: Challenges that arise in pipelining, including:
•Data Hazards: Occur when instructions depend on the results of previous instructions.
•Control Hazards: Arise from branch instructions that change the flow of execution.
Superscalar
Execution
Superscalar execution involves a processor executing more than one
instruction during a single clock cycle by dispatching multiple
instructions to appropriate functional units in the CPU.
•Instruction-Level Parallelism (ILP): The degree to which instructions
can be executed in parallel.
•Functional Units: Independent units in the CPU capable of executing
operations (e.g., ALUs, FPUs).
•Dispatch Logic: Determines which instructions can be issued
simultaneously without conflicts
Simultaneous Multithreading
(SMT) / Hyper-Threading
• SMT (also known as Hyper-Threading in Intel processors) allows
multiple threads to issue instructions to the CPU's multiple functional
units in a single cycle. This improves utilization of CPU resources.
Very Long InstructionWord
(VLIW)
• VLIW architectures pack multiple operations into a single long instruction
word. The compiler schedules instructions to ensure that they can be
executed in parallel, reducing the complexity of the CPU's control logic.
Speculative
Execution
• Speculative execution involves the CPU guessing the path of branch
instructions and executing instructions ahead of time. If the guess is
correct, performance is improved; if not, the speculative results are
discarded.
• Branch Prediction: Techniques used to guess the outcome of a branch
instruction.
• Rollback Mechanisms: Allow the CPU to revert to a known good state
if speculation fails.
advantages of using multiple
instruction streams
1. Increased Performance:
• Parallel Execution: Multiple instruction streams allow for parallel execution, which can
significantly increase the throughput and overall performance of a system. This is
especially beneficial for applications that require high computational power.
• Reduced Execution Time: By executing multiple instructions simultaneously, the overall
execution time for tasks can be reduced.
2. Better Resource Utilization:
• Efficient Use of CPU Resources: Techniques like pipelining and superscalar execution
make better use of the CPU’s functional units, reducing idle time and increasing efficiency.
• Load Balancing: Multithreading and multicore processors can balance the workload
across different processing units, improving resource utilization.
advantages of using multiple instruction
streams
3. Scalability:
• Scalable Performance: Systems designed with multiple instruction
streams can scale more easily to accommodate higher performance
demands by adding more processors or cores.
4. Enhanced System Responsiveness:
• Improved Multitasking: Systems that utilize multiple instruction
streams can handle multiple tasks simultaneously, leading to better
system responsiveness and user experience.
advantages of using multiple
instruction streams
5. Energy Efficiency:
• Dynamic Adjustment: Some architectures can
dynamically adjust the execution of instruction streams
to balance performance and energy consumption,
leading to more energy-efficient operations.
Disadvantages of using
multiple instruction
streams
1. Increased Complexity:
• Hardware Complexity: Implementing multiple instruction streams
requires complex hardware designs, including sophisticated control
logic for handling dependencies, hazards, and synchronization.
• Software Complexity: Writing software that efficiently utilizes
multiple instruction streams can be challenging. Developers need
to manage concurrency, synchronization, and potential race
conditions.
Disadvantages of using
multiple instruction
streams
2. Higher Costs:
• Development Costs: The design and development of processors
with multiple instruction streams are more expensive due to the
increased complexity.
• Power Consumption: While there can be energy efficiency
benefits, the overall power consumption of systems with multiple
instruction streams can be higher due to the additional hardware.
Disadvantages of using
multiple instruction
streams
• Diminishing Returns:
• Limited Parallelism: Not all applications can benefit from parallel
execution. Tasks that are inherently sequential may see little to
no performance improvement from multiple instruction streams.
• Amdahl’s Law: The overall speedup of a system is limited by the
portion of the task that cannot be parallelized. This law highlights
the diminishing returns of adding more parallel resources.
Real World
Example
• NVIDIA GPUs in Data Centers
• A major real-world example of a system that utilizes multiple instruction streams is
NVIDIA's GPUs (Graphics Processing Units) in data centers.
• Parallel Processing for AI and Machine Learning:
• NVIDIA GPUs are widely used in data centers to accelerate artificial
intelligence (AI) and machine learning workloads. These tasks often require
processing large amounts of data simultaneously, making GPUs ideal due to
their massive parallel processing capabilities.
• Example: Training a deep learning model involves processing numerous data
points in parallel. Each GPU core can handle an individual instruction stream,
allowing the model to be trained much faster compared to a traditional CPU.
Real World
Example
• Graphics and Video Processing:
• Data centers that provide cloud gaming or video streaming services
rely on GPUs to render graphics and process video streams in parallel.
• Example: Google Stadia, a cloud gaming service, uses NVIDIA
GPUs in their data centers to render game graphics. Each GPU can
handle multiple game instances, each running its own instruction
stream, providing a smooth gaming experience to users.
PARALLEL ARCHITECTURES
CONTENTS

v INTRODUCTION
v PIPELINE ARCHITECTURE
v ARRAY PROCESSOR
v MULTI-PROCESSOR ARCHITECTURE
v SYSTOLIC ARCHITECTURE v DATAFLOW
ARCHITECTURE v CONCLUSION
INTRODUCTION
Parallel architectures computational performance by
executing enhance tasks crucial for modern high-speed
applications. simultaneously,
This document explores:
Pipeline Architecture: Sequential task stages for concurrent
processing.
Array Processor: Synchronized grids for repetitive computations.
Multi-Processor Architecture: Shared or distributed memory
systems for collaboration.
Systolic Architecture: Synchronized data flow for specialized tasks.
Dataflow Architecture: Data-driven execution for fine-grained
parallelism.
These approaches address diverse computational needs with unique strengths
and limitations.
PIPELINE ARCHITECTURE
Concept
Pipelining is a technique where multiple stages of a task are executed in parallel by different processing
units.Each stage of the pipeline handles a different part of the task.

Stages

Fetch: Retrieve the next instruction from memory.

Decode: Interpret the instruction and prepare it for execution.
Execute: Perform the operation specified by the instruction. Memory Access: Access memory if the
instruction requires it. Write Back: Write the result back to a register or memory.
Performance Implications
Increased Throughput: More instructions are processed per unit time.
Reduced Latency: Instructions are completed faster.
Pipeline Stalls and Hazards: Issues that can slow down the pipeline, such as data dependencies and
resource conflicts.
PIPELINE ARCHITECTURE

Processing of Instruction
ARRAY PROCESSOR
Structure

This consists of an array of identical processing elements (PEs), interconnected via a network.
The PEs operate under the control of a single instruction stream, allowing for SIMD (Single
Instruction, Multiple Data) parallelism.

Operation

SIMD Execution: Each PE performs the same operation on different pieces of data in parallel,
useful for data-parallel tasks.

Control Unit: A central control unit broadcasts instructions to all PEs, which execute
them simultaneously.

Applications

Scientific Computing: Large-scale simulations, weather prediction, and computational

fluid dynamics.
SIMD MACHINE BLOCK DIAGRAM
MULTI-PROCESSOR ARCHITECTURE

Multiprocessor system means more than one processor in close communication. All the processor share
common bus, clock, memory and peripheral devices. Multiprocessor system is also called parallel system
or tightly coupled systems.
Features of Multiprocessor Systems :

v The processor should support efficient context switching operation.

v It supports large physical address space and larger virtual address spa ce.
v If one processor fails, then other processors should retrieve the interrupted process state so that
execution of the process can continue.
v The inter-process communication mechanism should be providedand implemented in hardware
as it becomes efficient and easy.
MULTI-PROCESSOR SYSTEMS
Multiprocessor systems are oftwo types :S
ymmetric Multiprocessing and
Asymmetric Multiprocessing.

Basic multiprocessor architecture are as follows :

1)Shared memory multiprocessor
2)Message based multiprocessor
3)Hybrid system
4)Cluster based
MULTI-PROCESSOR TYPES

A) Shared Memory Architecture:

Uniform Memory Access (UMA): Each processor has equal access time to memory.
Used in symmetric multiprocessors (SMP).
Non-Uniform Memory Access (NUMA): Access time depends on memory location
relative to the processor. Enhances scalability by reducing contention.
B) Distributed Memory Architecture:
Message Passing Interface (MPI): Processors communicate by sending and receiving
messages. Common in cluster computing.
Memory Models: Each processor has its private memory, reducing contention but
requiring explicit communication.
MULTI-PROCESSOR ARCHITECTURE

Multi-Processor Types
Advantages:
Scalability: Distributed memory systems can scale to a larger number of processors.
Performance: Increased performance through parallel processing.
Fault Tolerance: Redundancy enhances system reliability.
Disadvantages:
Complexity: Higher complexity in programming and managing communication between
processors.
Resource Contention: Potential for bottlenecks in shared memory systems.
SYSTOLIC ARCHITECTURE
Design Principles:

Rhythmic Computation: Processors compute and pass data in a rhythmic, synchronized manner, similar
to a heartbeat. This reduces the need for complex control logic and memory access.

Local Communication: Data is passed between neighboring processors, minimizing the need for
global communication and enhancing efficiency.

Applications:

Digital Signal Processing (DSP): Real-time processing of signals using Fast Fourier Transforms (FFT)
and convolution operations.

Matrix Operations: Efficient computation of matrix multiplications, essential in scientific and

engineering applications.

Medical Imaging: Accelerates image reconstruction in techniques like MRI and CT scans.
SYSTOLIC ARCHITECTURE
Limitations:

Flexibility: Less adaptable to irregular and dynamic computational tasks.

Programming Complexity: Requires precise

scheduling of data flow and operations.
Diagram:
DATAFLOW ARCHITECTURE
In dataflow architecture, the execution of instructions is driven by the availability of input
data rather than a sequential execution order. Nodes in the dataflow graph represent
operations, and edges represent data dependencies.

Operation:

Data-Driven Execution: Instructions are executed as soon as all necessary input data
becomes available, enabling high levels of parallelism.

Tokens: Data tokens flow through the graph, triggering operations as they arrive at nodes.

Potential Benefits:

High Parallelism: Inherent parallelism due to data-driven execution.

Efficient Resource Utilization: Resources are allocated dynamically based on data

availability.
DATAFLOW DIAGRAM
CONCLUSION
Summary of Key Points:

Recap of the detailed explanations of pipeline, array, multi-processor, systolic, and

dataflow architectures.

Discussed the practical applications, advantages, and limitations of each architecture.

Future Trends:

Quantum Computing: Leveraging quantum mechanics to perform parallel processing at

unprecedented scales.

Neuromorphic Engineering: Developing processors inspired by the human brain's neural

networks, aiming for energy-efficient parallel computation.

Heterogeneous Computing: Combining different types of processors (CPU, GPU, FPGA)

to optimize performance for various tasks.
Memory Access And Classification
Schemes
In Parallel Computing
Contents
Memory Access& Classification Schemes
Shared Vs. Distributed Memory Architectures
Symmetric Multiprocessing (SMP)
SIMD (Single Instruction, Multiple Data)
Vector Processing
GPU Co-processing
Memory Access In Parallel Computing
• Memory access plays a crucial role in parallel computing, where multiple
processors or threads work concurrently on shared or distributed tasks.
Understanding and optimizing memory access patterns is critical for achieving
efficient performance.

1:Memory Models:
• Shared Memory: Multiple threads or processors share a single memory space.
Access coordination is necessary to prevent conflicts, typically using mechanisms
like locks, semaphores, or atomic operations.
• Distributed Memory: Each processor has its private memory, and
communication between processors occurs over a network using message-passing
libraries such as MPI (Message Passing Interface).
2:Cache Coherency
•In shared-memory systems, multiple caches can hold copies of the same memory
location. Cache coherency protocols ensure that all caches reflect the same value,
preventing inconsistencies
3:Synchronization
•Synchronization is critical when threads/processors access shared data to avoid
race conditions. Common techniques include mutex locks, condition variables, and
barriers.
4: Data Locality
•Spatial Locality: Accessing data elements stored near each other.
•Temporal Locality: Reusing recently accessed data.
•Optimizing for data locality minimizes memory access latency and improves
performance.
5:Parallel Memory Access Strategies:
•Partitioned Data Access: Dividing data among threads to minimize contention and enhance
locality.
•Prefetching: Anticipating memory access and loading data into cache to reduce latency.
•Bulk Access: Transferring large chunks of data to reduce the overhead of individual
memory operations.
6: Memory Alignment
•Proper memory alignment ensures that data is accessed efficiently, avoiding performance
penalties due to unaligned access.
7:Memory Contention
•When multiple threads attempt to access the same memory simultaneously, contention can
occur, leading to performance degradation. Techniques such as locking mechanisms or
reducing contention by partitioning data can mitigate this.
Classification Schemes
• There are multiple classification schemes for parallel computing, including:
• Flynn's classification
• A simple and widely used scheme that classifies parallel systems based on the
number of instruction and data streams. The four categories are:
• Single-instruction single-data (SISD): Equivalent to a sequential program
• Single-instruction multiple-data (SIMD): Similar to repeating the same operation
on a large data set
• Multiple-instruction single-data (MISD): Rarely used
• Multiple-instruction multiple-data (MIMD): The most common type of parallel
program

Feng's classificationBased on the degree of parallelism, or whether the processing is

serial or parallel
Shared Vs. Distributed Memory Architectures

• Shared memory is the memory which

all the processors can access. In
hardware point of view it means all
the processors have direct access to
the common physical memory
through bus based (usually using
wires) access. These processors can
work independently while they all
access the same memory
Distributed Memory
• Distributed memory in hardware sense,
refers to the case where the processors
can access other processor's memory
only through network. In software sense,
it means each processor only can see
local machine memory directly and
should use communications through
network to access memory of the other
processors.
•
Aspect Shared Memory Architecture Distributed Memory
Architecture

Memory All processors share a single, Each processor has its private
Access unified memory space. memory space.
Communicatio Communication occurs through Communication occurs
n shared variables in memory. explicit via message-passing.
Latency Typically lower latency for local Latency depends on network
access. communication.
Scalability Limited scalability due to memory Highly scalable with the
and bus contention. addition of processors and
memory.
Complexity Easier to program and manage due Requires explicit
to a global memory view. management of data
distribution.
Advantages& Disadvantages Of Shared
Memory
Advantages Disadvantages
Simplified Programming Limited Scalability:
Efficient Communication Memory Bottleneck
Debugging Synchronization Overhead
Low Latency Fault Tolerance
Synchronization Support: Hardware Cost
Advantages& Disadvantages Of Distributed
Memory
• Advantages Disadvantages
Scalability Complex Programming
Fault Tolerance Higher Latency
No Contention Communication Overhead
High Performance Debugging Challenges
Cost-Effective Scaling Synchronization Complexity
Symmetric Multiprocessing (SMP)
• Symmetric Multiprocessing (SMP) is a multiprocessing architecture
where multiple processors share a single memory space and work
independently on different tasks. Each processor has equal access to the
memory and I/O resources, allowing for efficient parallel processing. SMP
systems improve performance and scalability by distributing workload
across processors, but they can face challenges such as memory
bottlenecks and cache coherency issues. This architecture is commonly
used in servers and high-performance computing systems.
Characteristics
• Multiple Processors: SMP systems have multiple processors (also known as CPUs), all of
which share access to the same memory and I/O resources.
• Shared Memory Architecture: In SMP, all processors share a single physical memory
space, which means that all CPUs can access any part of the memory.
• Equal Access to I/O Devices: All processors have equal access to input/output devices,
which allows for more efficient handling of peripheral devices and system resources.
• Single OS Instance: SMP systems typically run a single operating system instance, which
is responsible for managing all the processors and ensuring that tasks are divided among the
processors efficiently.
• Synchronization: The processors in an SMP system often require synchronization
mechanisms such as locks, semaphores, and barriers to ensure they do not conflict with
each other when accessing shared resources like memory.
Advantages of SMP
• Improved Performance: With multiple processors, tasks can be split and executed concurrently,
leading to a significant increase in processing speed and overall system throughput.
• Scalability: SMP systems are highly scalable as additional processors can be added to the system
to improve performance. This scalability is particularly useful for applications that require
substantial computing power.
• Single Shared Memory: Since all processors have access to the same memory, data sharing
between processors becomes easy, and there’s no need for complex memory management
techniques to transfer data between different memory spaces.
• Cost-Effective: SMP is generally more cost-effective than other forms of multiprocessing (such as
massively parallel processing), as it can use commercially available processors and hardware.
• Simplicity of Design: SMP systems typically have simpler designs because of their shared memory
architecture, which makes them easier to manage and program compared to other more complex
multiprocessor systems like distributed systems or NUMA.
Limitations of SMP
• Memory Bottleneck: Although all processors share memory, this shared access can become a
bottleneck in large-scale systems where multiple processors simultaneously access memory. The
system's performance can degrade due to contention for memory resources.
• Limited Scalability: While SMP systems can scale by adding more processors, there is a
practical limit to how many processors can be added before the performance gains diminish. This
is due to issues like memory bandwidth limitations, cache coherency problems, and system
overhead.
• Cache Coherency Issues: SMP systems need mechanisms to maintain cache coherence, i.e.,
ensuring that when one processor modifies a piece of data, other processors see the updated value.
This requires additional hardware or software support, which can introduce complexity and
reduce performance.
• Cost: Although SMP can be cost-effective for small to mid-sized systems, scaling the system with
more processors often results in increased costs due to the need for specialized hardware and
high-bandwidth interconnects.
SIMD (Single Instruction, Multiple Data):
Concept
• SIMD is a parallel computing architecture where a single instruction is
executed on multiple data elements simultaneously. It is a type of data-
level parallelism, where multiple processing elements perform the same
operation on different pieces of data concurrently, thus speeding up the
execution of tasks that involve processing large amounts of data.
• In a SIMD system, one instruction operates on several data elements in
parallel, making it highly efficient for operations that require the same
computation to be applied to a large set of data, such as vector and matrix
operations.
Operation of SIMD
• Instruction Set: The key characteristic of SIMD is that one instruction is issued by
the processor, but this instruction is applied to multiple data elements in parallel. This
is different from SISD (Single Instruction, Single Data), where the instruction is
applied to only one data element at a time.
• Parallelism: SIMD works by executing a single instruction on multiple data elements
at the same time. For example, if we need to perform an addition operation on two
arrays, SIMD can add corresponding elements from both arrays simultaneously, thus
reducing the time required for processing.
• Vector Processing: SIMD often uses vector processing, where data is organized in
vectors (arrays or lists of values). Processors equipped with SIMD capabilities
typically have vector registers that hold multiple data values (e.g., 4, 8, or more
elements) and apply the same operation to all the elements in parallel.
• SIMD Registers: The processors supporting SIMD typically have special registers
(e.g., 128-bit, 256-bit, or 512-bit wide registers) that can hold multiple data elements.
Example of SIMD Operation: If two arrays
A = [a1, a2, a3, a4] are given:
B = [b1, b2, b3, b4]
SIMD can perform the addition operation
as:
C = [a1+b1, a2+b2, a3+b3, a4+b4]
All four elements are processed in parallel
with the same addition instruction.
Applications of SIMD
1.Cryptography
•Encryption and Hashing: SIMD can accelerate encryption and decryption algorithms, such as
AES (Advanced Encryption Standard) or hashing algorithms like SHA, by applying the same
cryptographic transformation to multiple data blocks in parallel.
2.Machine Learning
•Neural Networks: SIMD can be applied in training and inference of machine learning models,
especially in tasks like matrix multiplications, dot products, and activation functions, which can
be performed on multiple data elements concurrently.
•Data Preprocessing: SIMD can speed up data preprocessing tasks like normalization, feature
scaling, and other operations that involve applying the same transformation across a large
dataset.
Applications of SIMD
3.High-Performance Computing (HPC)
•Parallel Algorithms: SIMD is used in supercomputers and clusters to optimize the
execution of parallel algorithms, especially in tasks that involve large-scale data processing
like climate modeling, weather forecasting, and genomics research.
•Parallel Databases: SIMD can be used in parallel database systems to accelerate operations
such as filtering, sorting, and aggregating large datasets.
4.Multimedia Processing
•Image and Video Processing: SIMD is ideal for tasks like image filtering, transformation,
and compression, where each pixel in an image or video frame requires the same operation
(e.g., applying a filter, edge detection, or color transformation).
Vector Processing: Principles
Vector processing refers to a form of computing where single instructions are applied to
vectors (arrays or sequences of data elements) rather than individual scalar data points. The
primary goal of vector processing is to exploit the inherent parallelism in certain types of
problems to increase computational efficiency.
Key Principles:
•Vector Registers: Vector processors use specialized registers to store multiple data elements
(such as integers or floating-point numbers) simultaneously. These registers can hold several
values at once, unlike scalar processors, which process one data element at a time.
•Single Instruction Multiple Data (SIMD): Vector processing often employs SIMD
architecture, where one instruction is applied to multiple data elements stored in vector
registers. This approach allows for parallel execution of the same operation across several
data elements simultaneously.
•Vector Operations: A vector processor performs operations like addition, subtraction,
multiplication, and division on entire vectors, significantly improving computational
efficiency for large datasets.
Vector Processing
Techniques in Vector Processing
• Vectorization: This technique refers to the process of converting scalar operations into
vector operations. For example, instead of performing a series of scalar multiplications,
vectorization would allow all multiplications to be carried out simultaneously on
corresponding elements of two vectors.
• Vector Length: The vector length determines how many data elements a vector register can
hold. The longer the vector length, the more data can be processed in parallel, which
increases throughput.
• Vector Instructions: Vector processors execute vector instructions that operate on entire
vectors. These instructions are more efficient than scalar instructions because they allow
multiple data points to be processed in a single cycle. Examples of vector instructions
include:
Vector Addition: Adding corresponding elements of two vectors.
Vector Dot Product: Computing the sum of the products of corresponding elements from
two vectors.
Vector Scaling: Multiplying each element of a vector by a constant.
Applications of Vector Processing
1. Image and Signal Processing
•Image Manipulation and Filtering: Vector processors can process image data in parallel,
applying operations like edge detection, smoothing, and transformations to multiple pixels
simultaneously.
•Signal Processing: In signal processing, tasks such as filtering, Fourier transforms (FFT), and
convolution operations can be efficiently implemented on vector processors.
•Audio and Video Processing: Vector processing is widely used in audio and video encoding,
decoding, and compression algorithms (e.g., MPEG or JPEG). Operations such as motion
compensation, DCT (Discrete Cosine Transform), and color space transformations benefit
from vectorization.
2. Machine Learning and Deep Learning
•Neural Network Computations: Vector processors can speed up the matrix and vector
operations commonly found in machine learning algorithms, especially in deep learning
models such as neural networks.
Applications of Vector Processing
• Linear Algebra: Many machine learning algorithms involve solving large systems of
linear equations or performing vector and matrix multiplication, tasks that can be
efficiently handled by vector processors.
3.Cryptography and Security
• Encryption and Decryption: Vector processors accelerate cryptographic algorithms
(e.g., AES, RSA) that operate on large blocks of data. These operations often involve
repetitive tasks that are highly parallelizable, making them ideal candidates for vector
processing.
• Hashing: Functions like SHA-256 and MD5, commonly used in data integrity checks,
can be optimized using vector processing by applying the same operations on multiple
data chunks simultaneously.
GPU Co-Processing: Role of GPUs in Accelerating
Parallel Computations
Introduction to GPU Co-Processing
• GPU (Graphics Processing Unit) co-processing refers to the use of GPUs in tandem with
traditional CPUs to accelerate computational tasks, particularly those that can be parallelized.
GPUs are highly specialized hardware designed for parallel processing, making them well-
suited for tasks that involve large-scale, repetitive computations. While CPUs are optimized
for single-threaded tasks with higher clock speeds and better general-purpose performance,
GPUs are optimized for handling thousands of simultaneous threads, excelling in parallel
processing tasks.
•Key Distinction:
•CPU vs. GPU:
• CPUs: Designed for sequential tasks with a few powerful cores optimized for high
single-thread performance.
• GPUs: Designed for massive parallel processing with many smaller, specialized cores
that excel at handling numerous threads simultaneously.
Role of GPUs in Parallel Computations
Parallel Architecture of GPUs
•Many Cores: GPUs consist of hundreds to thousands of smaller cores (CUDA cores in NVIDIA GPUs
or Stream Processors in AMD GPUs), enabling them to handle large numbers of threads in parallel.
•SIMD Architecture: Similar to vector processing, GPUs follow a Single Instruction, Multiple Data
(SIMD) model, applying the same instruction to multiple data points in parallel. This makes GPUs
highly efficient for tasks like matrix multiplication, image processing, and deep learning.
Massive Thread Management
•Thread Blocks and Grids: In GPU programming (using CUDA or OpenCL, for example), threads are
organized into blocks, and these blocks are arranged in grids. This structure enables the GPU to manage
and execute thousands or even millions of threads simultaneously. The threads within a block can share
data efficiently, and blocks are scheduled for execution across the GPU's cores.
CPU VS GPU DIFFERENCE
Role of GPUs in Parallel Computations
Memory Hierarchy
•Global, Shared, and Local Memory: GPUs have multiple types of memory:
• Global Memory: Accessible by all threads but relatively slower.
• Shared Memory: Faster, but only accessible within a thread block, allowing for faster
communication between threads in the same block.
• Local Memory: Each thread has its own local memory, used for private storage.
•The memory hierarchy in GPUs allows for high throughput and parallel execution
but requires careful management to maximize performance.
Memory Issues and Flynn's
Taxonomy
Memory Issues in Parallel
Computing
Cache Coherence
Cache coherence refers to the consistency of data stored in local caches of a
multiprocessor system. When multiple processors have their own caches, they may
store copies of the same memory location. If one processor updates its cache, other
caches must be updated or invalidated to maintain coherence. Two primary protocols
for maintaining cache coherence are:
Snoopy Bus Protocols: Each cache controller monitors bus transactions to determine
if it needs to update or invalidate its copy of a cache line.
Directory-Based Protocols: A centralized directory keeps track of which caches hold
copies of each memory block, managing permissions for access and modifications.
These protocols help prevent inconsistencies and ensure that all processors see the most
recent value of shared variables
Memory Issues in Parallel
Computing
Memory Contention
Memory contention occurs when multiple processors attempt to access the same
memory location simultaneously. This can lead to delays and reduced performance.
Techniques to mitigate contention include:
• Replication: Storing multiple copies of frequently accessed data in separate caches
can reduce latency and improve access times.
• Data Locality: Keeping data close to the processor that uses it most often can
significantly enhance performance by reducing access times
Memory Issues in Parallel
Computing
Data Locality
Data locality refers to the tendency of programs to access a relatively small portion of
their address space at any given time. Optimizing for data locality can improve
performance by:
• Temporal Locality: Reusing recently accessed data.
• Spatial Locality: Accessing data locations that are close together in memory.
Optimizing for data locality can significantly enhance performance by reducing memory
access times.
Multiprocessor Caches

• Multiprocessor cache systems are essential in parallel computing as

they help manage data access among multiple processors. The
organization and coherence of these caches are vital for maintaining
performance.
Multiprocessor Caches
Cache Organization
In multiprocessor systems, each processor typically has its own cache. The organization can be
hierarchical or flat, affecting how data is stored and accessed. Caches are organized into
levels (L1, L2, L3) with varying speeds and sizes. The organization can be visualized as
follows:
• L1 Cache: Fastest and closest to the processor. Each core has its own L1 cache.
• L2 Cache: Slightly larger and slower than L1. Each core usually has its own L2
cache.
• L3 Cache: Shared among all cores, larger but slower than L1 and L2.
This hierarchical structure allows for efficient data access while minimizing latency.
Coherence Protocols
These protocols ensure that multiple caches maintain a consistent view of memory. Two
common types include:
Snooping Protocols: Each cache monitors (or "snoops" on) the bus to detect changes
to memory locations it has cached. If a processor writes to a location, other caches
invalidate their copies.
Directory-Based Protocols: A directory keeps track of which caches have copies of
each memory location. This method reduces the amount of traffic on the bus compared
to snooping protocols, as only relevant caches are notified of changes.
Non-Uniform Memory Access
(NUMA)

NUMA is a memory design used in multiprocessor systems

where access time varies based on the memory location
relative to the processor accessing it.
Concept: In NUMA architectures, each processor has its
own local memory that it can access faster than non-local
memory (memory associated with other processors). This
setup is designed to reduce bottlenecks associated with
shared memory access.
Non-Uniform Memory Access
(NUMA)
Characteristics:
 Local vs. Remote Access: Local memory access is significantly faster than remote
access, making data locality crucial for performance.
 Scalability: As more processors are added, NUMA can improve performance by
allowing each processor to operate on its local memory without contention.
Performance Implications:
NUMA architectures can lead to improved performance for applications with high
locality of reference. However, they also require careful programming to avoid
performance degradation due to excessive remote memory accesses.
Flynn's Taxonomy
Flynn’s Taxonomy, proposed by Michael J. Flynn in
1966, classifies computer architectures based on the
number of concurrent instruction and data streams.
This classification remains relevant as it helps in
understanding how different systems handle parallel
processing. It identifies four primary categories:
SISD (Single Instruction Single Data): A single
processor executes a single instruction stream on a
single data stream. This is the traditional model for
most sequential programming.
Flynn's Taxonomy
 SIMD (Single Instruction Multiple Data): A single instruction stream controls multiple
processing units that execute the same instruction on different pieces of data simultaneously.
This is commonly used in vector processing and graphics applications.
 MISD (Multiple Instruction Single Data): This theoretical classification involves multiple
instruction streams operating on a single data stream. No practical implementations exist for
this category.
 MIMD (Multiple Instruction Multiple Data): Multiple processors execute different
instructions on different data streams simultaneously. This model is prevalent in modern
multicore processors and allows for more complex parallel processing strategies
Instruction-Level Support for
Parallel Programming
Instruction-level parallelism (ILP) refers to the ability of a processor to execute
multiple instructions simultaneously within a single thread of execution. This is
crucial for enhancing performance in parallel computing.
Key Concepts
Instruction Sets for Parallel Processing:
Processors can utilize specific instruction sets designed to exploit parallelism, such as
Very Long Instruction Word (VLIW) architectures, which allow multiple operations
to be encoded in a single instruction.
Instruction-Level Support for Parallel
Techniques for ILP: Programming
• Superscalar Architecture: Allows multiple instructions to be issued per clock cycle by
having multiple execution units.
• Out-of-Order Execution: Instructions are executed as resources are available rather than
strictly in the order they appear in the program, improving resource utilization.
Compiler Techniques:
Compilers play a significant role in exposing ILP by rearranging instructions to minimize
stalls and maximize execution efficiency.
Benefits of ILP
• Increases throughput by executing multiple instructions concurrently.
• Reduces idle time of processor resources, leading to better performance.

Create Deep Entity Using Rap
No ratings yet
Create Deep Entity Using Rap
9 pages
Aca Notes
No ratings yet
Aca Notes
148 pages
Parallel and Distributed Computing Complete Notes
No ratings yet
Parallel and Distributed Computing Complete Notes
41 pages
CRUZ - Long Quiz #3 GPGPU Programming
No ratings yet
CRUZ - Long Quiz #3 GPGPU Programming
8 pages
Types of Parallel Computing
No ratings yet
Types of Parallel Computing
11 pages
Parallel Computing Challanges
No ratings yet
Parallel Computing Challanges
7 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
Khaitan PSERC Webinar HPC Mar 2013 Slides
No ratings yet
Khaitan PSERC Webinar HPC Mar 2013 Slides
52 pages
High Performance Computing
100% (2)
High Performance Computing
164 pages
CS439 CC 2 Parallel Distributed Systems
No ratings yet
CS439 CC 2 Parallel Distributed Systems
37 pages
M2-Mutual Exclusion in Synchronization
No ratings yet
M2-Mutual Exclusion in Synchronization
4 pages
Mastering Problem Solving in AI
No ratings yet
Mastering Problem Solving in AI
8 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
SAP EWM An Introduction
100% (2)
SAP EWM An Introduction
26 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Preface Chapter 1: Chapter 2: Chapter 3: Chapter 4: Conclusion
No ratings yet
Preface Chapter 1: Chapter 2: Chapter 3: Chapter 4: Conclusion
27 pages
Waste Collection Amp Segregation Using Computer Vision and Convolutional Neural Network For Vessels
No ratings yet
Waste Collection Amp Segregation Using Computer Vision and Convolutional Neural Network For Vessels
6 pages
Slides
No ratings yet
Slides
36 pages
Parallel Programming - Unit 1
No ratings yet
Parallel Programming - Unit 1
81 pages
Memory Allocation
No ratings yet
Memory Allocation
6 pages
Parallel Algorithm Design Principles and Programming
No ratings yet
Parallel Algorithm Design Principles and Programming
8 pages
CC 2
No ratings yet
CC 2
35 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
PDC 4011
No ratings yet
PDC 4011
21 pages
AWS Assessment 1
No ratings yet
AWS Assessment 1
20 pages
Canopen Electronic Data Sheet (Eds) : Public Available Specification
No ratings yet
Canopen Electronic Data Sheet (Eds) : Public Available Specification
26 pages
Module 1
No ratings yet
Module 1
14 pages
Enhanced Hadoop Ecosystem Presentation
No ratings yet
Enhanced Hadoop Ecosystem Presentation
10 pages
Binder 1
No ratings yet
Binder 1
164 pages
A Capstone Course On Agile Software Development Using Scrum
No ratings yet
A Capstone Course On Agile Software Development Using Scrum
9 pages
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
No ratings yet
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
27 pages
CMP 252 - Parallelism Fundamentals
No ratings yet
CMP 252 - Parallelism Fundamentals
64 pages
Distributed Computing and Communication Design Principles
No ratings yet
Distributed Computing and Communication Design Principles
21 pages
Parallel and Distributed Computing Module I
No ratings yet
Parallel and Distributed Computing Module I
26 pages
CC Chapter1
No ratings yet
CC Chapter1
20 pages
Harry H. Porter Iii Theory of Computation - Chapter 1a Page 1 of 79
No ratings yet
Harry H. Porter Iii Theory of Computation - Chapter 1a Page 1 of 79
79 pages
FCA SE Runbook Reinstall RabbitMQ Cluster
No ratings yet
FCA SE Runbook Reinstall RabbitMQ Cluster
20 pages
Changes Spark
No ratings yet
Changes Spark
137 pages
Co 1
No ratings yet
Co 1
66 pages
Lecture 1 - Parallel and Distributed Computing
100% (1)
Lecture 1 - Parallel and Distributed Computing
25 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
General File Structure
No ratings yet
General File Structure
24 pages
Watercolor Organic Shapes SlidesMania
No ratings yet
Watercolor Organic Shapes SlidesMania
23 pages
HPC Note
No ratings yet
HPC Note
39 pages
Parallel Computing
No ratings yet
Parallel Computing
25 pages
Lecture 01
No ratings yet
Lecture 01
34 pages
Batch Data Communication
No ratings yet
Batch Data Communication
38 pages
ShipRight FDA Level 2 FAQ
No ratings yet
ShipRight FDA Level 2 FAQ
11 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Parallel Computing Pastpaper Solve by Noman Tariq
No ratings yet
Parallel Computing Pastpaper Solve by Noman Tariq
30 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Mounting Cdrom Unix
No ratings yet
Mounting Cdrom Unix
7 pages
Unit 6 Homework Week November 26
No ratings yet
Unit 6 Homework Week November 26
4 pages
Cloud Computing Unit-1
No ratings yet
Cloud Computing Unit-1
51 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
ELE-506 Digital Signal Processing: Lecture 1 - INTRODUCTION
No ratings yet
ELE-506 Digital Signal Processing: Lecture 1 - INTRODUCTION
11 pages
80286
No ratings yet
80286
74 pages
Operating System Lab - Manual
No ratings yet
Operating System Lab - Manual
70 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
CC - Unit 1
No ratings yet
CC - Unit 1
29 pages
HPC Lecture 2 Points
No ratings yet
HPC Lecture 2 Points
7 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
CS621 Cheatsheet
No ratings yet
CS621 Cheatsheet
11 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages
PDC 3
No ratings yet
PDC 3
26 pages
Parallel Computing An Introduction
No ratings yet
Parallel Computing An Introduction
40 pages
BCSE412L - Parallel Computing 01
No ratings yet
BCSE412L - Parallel Computing 01
27 pages
Module 1
No ratings yet
Module 1
30 pages
Optimisation and Optimal Control
No ratings yet
Optimisation and Optimal Control
82 pages
CC Unit-1
No ratings yet
CC Unit-1
17 pages
XML Publisher Multiple Languages
No ratings yet
XML Publisher Multiple Languages
17 pages
PDC Digital Notes 6 17
No ratings yet
PDC Digital Notes 6 17
12 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Multiprocessor System Architecture
No ratings yet
Multiprocessor System Architecture
11 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
31 pages
FTSearch Method
No ratings yet
FTSearch Method
280 pages
CS621 - Handouts - Mids
No ratings yet
CS621 - Handouts - Mids
61 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Intro PDC1
No ratings yet
Intro PDC1
3 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Introduction To Parallel and Distributed Computing
No ratings yet
Introduction To Parallel and Distributed Computing
29 pages
FFT
No ratings yet
FFT
10 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Bending Plate Analysis in ABAQUS Package: Tomasz Żebro, Jerzy Pamin Version 1.1, 2017-05-18
No ratings yet
Bending Plate Analysis in ABAQUS Package: Tomasz Żebro, Jerzy Pamin Version 1.1, 2017-05-18
9 pages
Syllabus
No ratings yet
Syllabus
2 pages
10987C 05
No ratings yet
10987C 05
25 pages
Gravador CCTV Hiseeu
No ratings yet
Gravador CCTV Hiseeu
20 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Important Questions: Subject: Web Design Class: - BCA-4nd Sem
100% (3)
Important Questions: Subject: Web Design Class: - BCA-4nd Sem
2 pages
Mansi Kadam PC Lab Assignment 1
No ratings yet
Mansi Kadam PC Lab Assignment 1
4 pages
Mastering Concurrency and Parallel Programming Unlock the Secrets of Expert-Level Skills.pdf
From Everand
Mastering Concurrency and Parallel Programming Unlock the Secrets of Expert-Level Skills.pdf
Larry Jones
No ratings yet

U1&u2 Padcom-25

Uploaded by

U1&u2 Padcom-25

Uploaded by

PARALLEL DISTRUBUTED

Fetch: Retrieve the next instruction from memory.

Scientific Computing: Large-scale simulations, weather prediction, and computational

v The processor should support efficient context switching operation.

Basic multiprocessor architecture are as follows :

A) Shared Memory Architecture:

Matrix Operations: Efficient computation of matrix multiplications, essential in scientific and

Flexibility: Less adaptable to irregular and dynamic computational tasks.

Programming Complexity: Requires precise

High Parallelism: Inherent parallelism due to data-driven execution.

Efficient Resource Utilization: Resources are allocated dynamically based on data

Recap of the detailed explanations of pipeline, array, multi-processor, systolic, and

Discussed the practical applications, advantages, and limitations of each architecture.

Quantum Computing: Leveraging quantum mechanics to perform parallel processing at

Neuromorphic Engineering: Developing processors inspired by the human brain's neural

Heterogeneous Computing: Combining different types of processors (CPU, GPU, FPGA)

Feng's classificationBased on the degree of parallelism, or whether the processing is

• Shared memory is the memory which

• Multiprocessor cache systems are essential in parallel computing as

NUMA is a memory design used in multiprocessor systems

You might also like