PC Module1
PC Module1
MODULE-1
• Scientific simulations
Types of Parallelism
Data Parallelism: Focuses on performing the same operation on different pieces of data
simultaneously. Example: Applying a filter to every pixel in an image.
Task Parallelism: Involves executing different tasks or functions at the same time. Example: A
computer playing music while downloading a file and running a virus scan.
a. Data Parallelism
• The same task is performed on different parts of the data.
b. Task Parallelism
• Different tasks or functions are executed at the same time, possibly on different data.
In modern computing, hardware that enables parallel execution plays a significant role in
enhancing performance. Although multiple issue and pipelining allow multiple operations
within a processor to be executed in parallel, these mechanisms are not directly observable
or controllable by programmers. For the purpose of parallel programming, we define parallel
hardware as that which is visible to the programmer and whose capabilities can be
exploited or must be adapted to through source code changes.
Parallel computers can be classified using two distinct approaches. One of the most commonly
referred schemes is Flynn’s taxonomy, which categorizes systems based on the number of
instruction streams and data streams they handle simultaneously.
A traditional system based on the von Neumann architecture is called a SISD system—
Single Instruction Stream, Single Data Stream—since it processes one instruction and one
data item at a time. In contrast, parallel systems can either follow the SIMD model (Single
Instruction, Multiple Data) or the MIMD model (Multiple Instruction, Multiple Data).
Department of CSE, RNSIT
Parallel Computing (BCS702)
SIMD systems apply the same instruction to multiple data items concurrently, whereas MIMD
systems allow independent instruction streams operating on different data.
Consider the task of vector addition, where two arrays x and y of length n need to be added
element-wise:
If there are n datapaths, all elements can be processed in one cycle. If there are fewer
datapaths, say m, the operation is performed in blocks of m elements. With m = 4 and n =
15, elements are added in four separate stages: 0–3, 4–7, 8–11, and 12–14. In the final group,
one datapath remains idle, illustrating underutilization due to unequal division.
Here, branching logic forces some datapaths to idle while others execute, reducing
efficiency. Also, traditional SIMD systems operate synchronously, meaning all datapaths
must wait for the next instruction to be broadcast. They do not store instructions, so they
cannot defer execution.
Nevertheless, SIMD is highly effective for processing large, uniform datasets, such as image
pixels or signal samples. It excels when the same instruction must be applied to many data
points, making it useful for graphics, matrix operations, and scientific computing. Over
time, SIMD systems have evolved. Initially, companies like Thinking Machines pioneered
their use in supercomputing. Later, their prominence declined, leaving vector processors
as the most notable SIMD representatives. Today, GPUs and desktop CPUs often integrate
SIMD features to accelerate multimedia and numeric processing.
Vector Processors
Vector processors are computing systems designed to perform operations on entire arrays
or vectors of data simultaneously. In contrast, traditional CPUs process one data element at
a time, known as scalar processing. Vector processors are especially effective in applications
that involve performing the same operation repeatedly across large datasets.
A central component of a vector processor is the vector register, which holds multiple data
elements and allows simultaneous operations on all elements. The vector length is fixed by
the architecture and typically ranges from 4 to 256 elements, each being 64 bits.
Example snippet:
for (i = 0; i < n; i++)
x[i] += y[i];
In a vector processor, this loop can be executed with one vector load, one vector add, and
one vector store per block of vector_length elements.
Vector processors offer high performance and ease of use for regular data patterns.
Vectorizing compilers can automatically detect loops that can be transformed into vector
operations and provide feedback on non-vectorizable code. These systems utilize high
memory bandwidth, ensuring efficient data use without unnecessary memory fetches.
However, they are less suitable for irregular data structures. Additionally, there's a practical
limit to how much vector processors can scale by increasing the vector length. Modern
architectures overcome this by increasing the number of vector units, not the length.
Commodity systems today offer support for short-vector operations, but long-vector
processors are custom-built and expensive, limiting their widespread deployment.
Shader functions are inherently parallel and apply uniformly to thousands of elements like
vertices or fragments. Since similar elements tend to follow the same logic path, SIMD
parallelism is used extensively. Each GPU core may contain dozens of datapaths, enabling
massively parallel execution.
A single image may involve hundreds of megabytes of data, so GPUs are built for high-
throughput memory access. They avoid delays using hardware multithreading, which
allows the state of hundreds of threads to be stored and quickly swapped. Thread count per
core depends on the resources consumed by each shader, such as the number of registers
required.
While GPUs perform exceptionally well on large workloads, they struggle with small tasks
due to overhead and underutilized resources. Importantly, GPUs are not pure SIMD machines.
Although their internal datapaths operate in SIMD fashion, multiple instruction streams can
run on a single GPU, making them resemble hybrid systems with both SIMD and MIMD
characteristics.
GPUs may use shared memory, distributed memory, or a combination of both. For example,
multiple cores can access a common memory block, while other cores use a different
memory region. Inter-core communication may involve networked connections. However,
in typical programming models, GPUs are treated as shared-memory systems.
Over time, GPUs have become popular beyond graphics. Their architecture supports high-
performance computing (HPC) tasks, including machine learning, scientific simulations,
and data analysis.
Example:
If a shader function operates on 1000 pixels, and each GPU core can process 128
elements simultaneously, then only ~8 instruction steps are needed instead of 1000.
CUDA, OpenCL, and SYCL have been developed to harness their computing potential.
Key Characteristics:
1. Asynchronous Execution:
o Processors operate independently without needing to synchronize their
clocks.
o Each processor may execute at its own pace, making them asynchronous
systems.
2. No Global Clock or Lockstep Execution:
o Unlike SIMD systems, there is no need for a global clock or for processors to
execute in lockstep (simultaneously on the same instruction).
3. Independent Control Units:
o Each processor has its own control unit, meaning they can run different
programs independently.
4. Scalability:
o MIMD systems scale well with increasing processors, making them suitable for
large-scale computation tasks.
Shared-Memory Systems
Shared-memory systems are computer systems where multiple processors (CPUs or cores)
access and share a common main memory. This model allows inter-process
communication via shared variables, avoiding the need for explicit message passing.
Advantages:
• Simplified programming model.
• Memory access is consistent across processors.
Limitations:
• Can become a bottleneck as the number of cores increases.
• Limited scalability.
Advantages:
• Faster access to local memory.
• More scalable than UMA.
• Potential to use larger memory spaces.
Memory Access Time Uniform (same for all cores) Varies (local vs remote)
Distributed-Memory Systems
Distributed-memory systems are computer architectures in which each processor or node
has its own local memory, and processors communicate by passing messages over a
network.
Cluster:
• A cluster is the most common form of distributed-memory system.
• It is made up of multiple commodity systems (like standard PCs).
• These systems are connected via a commodity interconnection network, such as
Ethernet.
• Each individual computer in the cluster is known as a node.
Nodes:
• Nodes are the computational units in the system.
• In modern systems, each node is often a shared-memory system (e.g., multicore
processor).
Department of CSE, RNSIT
Parallel Computing (BCS702)
Grid Computing:
• The grid infrastructure connects geographically distributed computers into a single
distributed-memory system.
• Grids are typically heterogeneous, meaning that:
o The nodes may be built from different types of hardware.
o Software and operating systems may also vary.
Feature Description
Advantages:
• Excellent for parallel processing at scale.
• Nodes can be added easily to increase computational power.
• Fault tolerance is improved due to distributed nature.
Disadvantages:
• Programming is complex due to manual handling of communication.
• Latency and bandwidth limitations can affect performance.
• Synchronization and data consistency across nodes must be managed carefully.
The interconnect plays a decisive role in the performance of both distributed- and shared-
memory systems: even if the processors and memory have virtually unlimited performance,
a slow interconnect will seriously degrade the overall performance of all but the simplest
parallel program.
Although some of the interconnects have a great deal in common, there are enough differences
to make it worthwhile to treat interconnects for shared-memory and distributed-memory
separately.
Shared-Memory Interconnects
If the two cores attempt to simultaneously access the same memory module. For example,
Figure 2.7(c) shows the configuration of the switches if:
• P1 writes to M4
• P2 reads from M3
• P3 reads from M1
• P4 writes to M2
Crossbars allow simultaneous communication among different devices, so they are much
faster than buses. However, the cost of the switches and links is relatively high. A small bus-
based system will be much less expensive than a crossbar-based system of the same size.
Currently, the two most widely used interconnects on shared-memory systems are:
1. Buses
2. Crossbars
A bus is a collection of parallel communication wires together with some hardware that
controls access to the bus. The key characteristic of a bus is that the communication wires are
shared by the devices that are connected to it.
Buses have the virtue of low cost and flexibility: multiple devices can be connected to a bus
with little additional cost. However, since the communication wires are shared, as the number
of devices connected to the bus increases, the likelihood of contention increases, and the
expected performance decreases.
Therefore, if we connect a large number of processors to a bus, we would expect that the
processors would frequently have to wait for access to main memory. Thus, as the size of
shared-memory systems increases, buses are rapidly being replaced by switched
interconnects.
As the name suggests, switched interconnects use switches to control the routing of data
among the connected devices.
A crossbar is illustrated in Figure 2.7(a).
• The lines are bidirectional communication links,
• The squares are cores or memory modules,
Distributed-memory interconnects
A key measure of communication capability is the bisection width, which refers to the
minimum number of links that must be removed to divide the system into two equal halves.
For a ring with 8 nodes, the bisection width is 2, while in a square toroidal mesh with p =
q² (where q is even), the bisection width is 2√p.
Another important metric is the bisection bandwidth, which is the sum of the bandwidths
of the links connecting two halves of the system. For example, if each link in a ring has a
bandwidth of 1 Gbps, the total bisection bandwidth would be 2 Gbps. This metric gives a
good idea of the data transfer capacity across the network.
An ideal but impractical network is the fully connected network, where each switch is
directly connected to every other switch. Its bisection width is p² / 4, but it requires p(p −
1)/2 links, making it unfeasible for large systems. It serves as a theoretical best-case model.
The hypercube is another direct interconnect used in actual systems. It is built inductively—
a 1D hypercube has 2 nodes, and each higher-dimensional hypercube is built by joining two
lower-dimensional ones. A hypercube of dimension d has p = 2ᵈ nodes, and each switch
connects to d other switches. The bisection width is p / 2, offering higher connectivity than
rings or meshes. However, the switches are more complex, requiring log₂(p) connections,
making them more expensive than mesh-based switches.
In contrast, indirect interconnects separate the switches from direct processor connections
and route communications through a switching network. Two popular examples are the
crossbar and the omega network. In a crossbar, all processors can communicate
simultaneously with different destinations as long as there is no conflict. Its bisection width
is p, and it offers high flexibility but is very costly, requiring p² switches.
The omega network is a more cost-efficient design using 2×2 crossbar switches arranged
in stages. It allows for some parallel communications but has contention—certain
communications cannot occur at the same time. Its bisection width is p / 2, and it uses only
p × log₂(p) switches, making it significantly cheaper than a full crossbar.
• Even though x = 7 is executed before using it in z1 = 4 * x + 8, Core 1 may still use the
old cached value of x = 2.
• This leads to incorrect value in z1.
o Expected: z1 = 4 * 7 + 8 = 36
o Actual (due to stale cache): z1 = 4 * 2 + 8 = 16
• Even after Core 0 updates x, unless x is evicted and reloaded in Core 1’s cache, Core 1
continues using the old value.
• This happens regardless of write policy:
o Write-through: Updates go to main memory, but not to other caches.
o Write-back: Update stays only in Core 0’s cache, not visible to others.
Caches designed for single-processor systems do not ensure coherence when multiple
processors cache the same variable.
• There’s no mechanism to ensure that an update by one core is seen by others.
• This results in unpredictable behavior in shared-memory programs.
The cache coherence problem arises when multiple caches hold the same variable, but
an update by one processor is not reflected in others' caches.
• This leads to inconsistent views of memory.
• Programs cannot rely on hardware caches to behave consistently across cores.
In shared-memory multiprocessor systems, each processor core has its own private cache.
When multiple cores cache and access the same memory location, inconsistencies can arise
due to updates not being reflected across all caches. This leads to the cache coherence
problem.
To handle cache coherence, two main approaches are used: Snooping cache coherence and
Directory-based cache coherence.
Snooping cache coherence is based on the principle used in bus-based systems. All cores share
a common bus, and any communication on the bus can be observed by all cores. When Core 0
updates a shared variable x in its cache, it broadcasts the update on the bus. If Core 1 is
observing the bus, it can detect the update and invalidate its own cached copy of the variable.
The broadcast indicates that the cache line containing x has been updated, but not the value
of x itself.
Snooping does not require the interconnect to be a bus, but it must support broadcast. It works
with both write-through and write-back cache policies. In write-through, updates are
immediately written to memory, which other cores can observe. In write-back, updates
remain in the local cache until evicted, so additional messages are required to notify other
cores.
Snooping cache coherence does not scale well to large systems because it requires
broadcasting every time a variable is updated. As the number of cores increases, the
communication overhead becomes a bottleneck.
Directory-based cache coherence is suitable for larger systems. These systems support a
single address space, and a core can access a variable in another core's memory by direct
reference. In this approach, a data structure called a directory is used to keep track of which
cores have a copy of each cache line.
The directory is usually distributed, with each core or memory module maintaining the status
of its local memory blocks. When a cache line is read by a core, the directory entry is updated
to reflect that the core has a copy. When a write occurs, the directory is checked to identify all
cores holding the copy, and only those are notified to invalidate or update their cache lines.
Directory-based coherence avoids global broadcasts and scales better in systems with many
cores. However, it requires additional storage and maintenance for the directory structure.
False sharing
False sharing occurs when CPU caches operate on cache lines (not on individual variables),
and multiple cores update variables that lie on the same cache line, even if they are
logically independent.
This results in unnecessary cache invalidations and causes performance degradation.
iter_count = m / core_count;
Core 0 executes:
for (i = 0; i < iter_count; i++)
for (j = 0; j < n; j++)
y[i] += f(i, j);
Core 1 executes:
for (i = iter_count; i < 2 * iter_count; i++)
for (j = 0; j < n; j++)
y[i] += f(i, j);
Problem:
• Assume s = 2 (two cores), and double is 8 bytes.
Department of CSE, RNSIT
Parallel Computing (BCS702)
• Cache line size = 64 bytes, and y[0] starts at beginning of a cache line.
• When two cores simultaneously execute their sections, they access different
elements of y[], but these elements are within the same cache line.
• So each core’s cache gets invalidated, and they must fetch updated lines from
memory.
Effect:
• This creates high memory traffic and poor performance even though no true data
sharing exists.
• Known as false sharing, this issue occurs due to the way data is stored in memory,
not due to program logic.
Distributed-Memory Systems:
• Each processor has its own local memory.
• Communication happens via explicit message passing.
• Requires more effort to program, but hardware scales better:
o Hypercube and toroidal mesh interconnects are relatively inexpensive.
o Can support thousands of processors.
• Well-suited for problems involving large-scale data or high computation needs.
Processors share a single memory space. Each processor has its own local memory.
Crossbars support more processors but are Supports thousands of processors cost-
very expensive. effectively.
Parallel hardware is now widely available. Most desktops and servers today are built with
multicore processors, allowing them to perform multiple tasks at once. However, parallel
software has not advanced at the same pace. Apart from specialized systems such as
operating systems, database systems, and web servers, most commonly used software still
runs in a sequential manner and does not take full advantage of parallel hardware.
This mismatch is a concern. In the past, application performance improved steadily with
better hardware and smarter compilers. Today, performance gains depend on how well
software uses parallelism. To keep improving the speed and power of applications,
developers must learn to write software for shared- and distributed-memory
architectures.
Before diving into programming techniques, we must understand some basic terms. In
shared-memory programming, a single process creates multiple threads. These threads
work together and share memory, so we say threads carry out tasks. In distributed-
memory programming, multiple processes are created, each with its own memory. These
processes work independently and communicate by sending messages. In this book,
whenever the topic applies to both models, we use the combined term “processes/threads”.
1.2.1 Caveats:
This section only discusses software for MIMD (Multiple Instruction, Multiple Data)
systems. It does not cover GPU programming, because GPUs use different programming
interfaces (APIs). Also, the coverage in this book is not exhaustive; the goal is to provide a
basic idea of the issues, not complete technical depth.
A major focus will be on the SPMD (Single Program, Multiple Data) model. In SPMD, all
threads or processes run the same program, but they may perform different tasks depending
on their thread or process ID. This is usually done using conditional statements, like:
if (I’m thread/process 0)
do this;
else
do that;
This style makes it easy to implement data parallelism, where each process works on a
different part of the data. For example:
if (I’m thread/process 0)
operate on the first half of the array;
else
operate on the second half of the array;
Also, SPMD programs can support task parallelism by dividing different tasks among
processes or threads. So, even though all threads run the same program, they may be doing
different work on different data or different work altogether, depending on how the code
is structured.
However, most problems need more than just dividing work. We must also:
2. Synchronize threads/processes
3. Enable communication between them
In distributed-memory systems, communication often also ensures synchronization. In
shared-memory systems, threads may synchronize to communicate.
1.2.3 Shared-Memory
In shared-memory programs, variables can be either shared or private.
• Shared variables: accessible by all threads (used for communication).
• Private variables: accessible by only one thread.
Communication between threads is usually implicit through shared variables—there’s no
need for explicit messages.
Nondeterminism
In any MIMD system where processors execute asynchronously, nondeterminism is likely.
A computation is nondeterministic if the same input gives different outputs. This happens
when threads execute independently, and their execution speeds vary from run to run.
Department of CSE, RNSIT
Parallel Computing (BCS702)
Example: Two threads, one with rank 0 and another with rank 1, store:
• my_x = 7 (thread 0)
• my_x = 19 (thread 1)
Both execute:
printf("Thread %d > my_val = %d\n", my_rank, my_x);
Output can be:
mathematica
Thread 0 > my_val = 7
Thread 1 > my_val = 19
or:
mathematica
Thread 1 > my_val = 19
Thread 0 > my_val = 7
or even interleaved output. This is fine here since output is labeled, but nondeterminism in
shared-memory programs can lead to serious errors.
Suppose both threads compute my_val and want to update a shared variable x, initially 0:
my_val = Compute_val(my_rank);
x += my_val;
The operation x += my_val is not atomic; it involves:
• Load x from memory
• Add my_val
• Store back to x
If two threads do this simultaneously, one update may overwrite the other. This is a race
condition—result depends on which thread finishes first.
To prevent this, we use a critical section, where only one thread can execute at a time. This is
ensured using a mutex (mutual exclusion lock):
my_val = Compute_val(my_rank);
Lock(&addmyval_lock);
x += my_val;
Unlock(&addmyval_lock);
A mutex ensures mutual exclusion. When one thread locks the section, others wait. This
avoids incorrect updates but serializes the critical section. Hence, critical sections should be
short and minimal.
Alternatives to mutexes:
• Busy-waiting: a thread waits in a loop:
if (my_rank == 1)
while (!ok_for_1); // Busy wait
x += my_val;
if (my_rank == 0)
ok_for_1 = true; // Allow thread 1
Simple to use but wastes CPU time.
• Semaphores: similar to mutexes, offer more flexibility in synchronization.
• Monitors: high-level objects; only one thread can call a method at a time.
• Transactional memory: treats critical sections like database transactions. If a thread
can’t complete, changes are rolled back.
Thread safety
In most cases, serial functions can be used in parallel programs without issues. But some
functions, especially those using static local variables, can cause problems.
In C, local variables inside functions are stored on the stack, and since each thread has its own
stack, these are private to each thread. But static variables persist across function calls and
are shared among threads, which can lead to unexpected behavior.
Example: The C function strtok splits a string into substrings using a static char* variable to
remember the position. If two threads call strtok on different strings at the same time, the
shared static variable can be overwritten, causing loss of data or wrong outputs.
Such a function is not thread-safe. In multithreaded programs, using it can lead to errors or
unpredictable results.
A function is not thread-safe if multiple threads access shared data without proper
synchronization. So, although many serial functions are safe to use in parallel programs,
programmers must be careful when using functions originally designed for serial execution.
1.2.4 Distributed-memory
In distributed-memory systems, each core or processor can directly access only its own
private memory. Unlike shared-memory systems, cores do not have access to a single shared
memory space. To facilitate communication between these isolated memories, various APIs
are used, with the most widely used being message-passing. This approach enables data
exchange between processes running on different nodes or cores.
Interestingly, distributed-memory APIs can even be implemented on shared-memory
hardware by logically partitioning shared memory into private address spaces and handling
communication via software tools or compilers.
Unlike shared-memory systems that use multiple threads, distributed-memory programs
are generally implemented using multiple processes. These processes often run on separate
CPUs under independent operating systems, and launching a single process that spawns
threads across distributed systems is generally not feasible.
Message Passing
The message-passing approach revolves around the use of two core functions: Send() and
Receive(). Each process is assigned a rank ranging from 0 to p - 1 (where p is the total number
of processes). Communication between processes typically follows this model:
char message[100];
my_rank = Get_rank();
if (my_rank == 1) {
sprintf(message, "Greetings from process 1");
Send(message, MSG_CHAR, 100, 0);
}
else if (my_rank == 0) {
Receive(message, MSG_CHAR, 100, 1);
printf("Process 0 > Received: %s\n", message);
}
The function Get_rank() returns the process’s unique rank. The behavior of Send() and
Receive() can vary. In a blocking send, the sender waits until the matching Receive() is ready.
In non-blocking send, data is copied into a buffer, and the sender resumes execution
immediately.
Each process runs the same program, but performs different actions based on its rank—a
model known as SPMD (Single Program, Multiple Data). Variables like message[] exist
separately in each process’s private memory space.
Message-passing libraries often include collective communication functions, such as
broadcast, where one process sends data to all others, and reduction, where individual
results are combined (e.g., summing values).
The most widely used message-passing API is MPI (Message Passing Interface), which we
will study in more detail in the next module.