Lecture 3 and 4HPC
Lecture 3 and 4HPC
Lecture 3 & 4
Introduction to High Performance
Computing
1
Design Stages:
• Design parallel algorithms can be structured by the
following four stages:
1) Decomposition.
2) Communication
3) Agglomeration.
4) Scheduling.
2
Design Stages:
1- Decomposition (partitioning):
• Decompose the problem into small tasks that can be
executed concurrently.
• Task: indivisible sequential computation unit.
3
Design Stages:
2- Communication:
• Determine communication required to coordinate task
execution.
4
Design Stages:
3- Agglomeration:
• Combine tasks into larger tasks to improve the
performance or to reduce communication cost.
• Also determine whether it is worthwhile to replicate data
and/or computation.
5
Design Stages:
4- Scheduling (Mapping):
• Assign each task to a processor in a manner that
minimizes execution time (by minimizing communication
and idling).
6
Design Stages:
7
Decomposition techniques:
1) Domain Decomposition.
2) Functional Decomposition.
3) Recursive Decomposition
4) Hybrid Decomposition:
8
Decomposition techniques:
1- Domain Decomposition:
• Decompose the data associated with a problem {Block,
Cyclic), then each parallel task works on a portion of
the data.
9
Decomposition techniques:
1- Domain Decomposition:
• Use the owner computes rule (the process assigned a
particular data item is responsible for all computation
associated with it).
10
Decomposition techniques:
2- Functional Decomposition:
• Decompose the problem according to the
communication must be done, then each task performs a
portion of the overall work.
• Consider data dependences (two memory accesses are
involved in a data dependence if they may refer to the
same memory location and one of the accesses is a
write).
11
Decomposition techniques:
3- Recursive Decomposition:
• A method for inducing concurrency in problems that
can be solved using the divide-and-conquer strategy.
12
Decomposition techniques:
4- Hybrid Decomposition:
• A mix of decomposition techniques.
13
Task Dependency Graph:
• A task-dependency graph is a directed acyclic graph in
which the nodes represent tasks, and the directed edges
indicate the dependencies (communication) amongst
them.
• The task corresponding to a node can be executed when
all tasks connected to this node by incoming edges have
completed.
14
Example:
Database query processing:
15
Decomposition(a):
Data Decomposition
Recursive Decomposition
16
Decomposition(b):
17
Task dependency graphs:
18
Task dependency graphs:
• The maximum number of tasks (processors) that can be
executed simultaneously in a parallel program at any
given time is known as its maximum degree of
concurrency.
Decomposition Decomposition
(a) (b)
Shortest time 27-time units 34-time units
No of processors 4 4
Maximum Concurrency 4 4
19
Scheduling (Mapping):
20
Performance Analysis:
• To compare between two/more parallel algorithms for
the same problem, hers comes the analysis.
• Usually the following metrics are used:
Parallel runtime (complexity)(𝑇𝑝 )
• The estimated execution time that elapses between the
algorithm's start and termination.
𝑻𝒑 = 𝑻𝒄𝒐𝒎𝒑 + 𝑻𝒄𝒐𝒎
➢𝑇𝒄𝒐𝒎𝒑 : Computation time.
➢𝑇𝒄𝒐𝒎 : Communication time.
21
Performance Analysis:
Speedup (S):
• The ratio of the time taken to solve a problem on a
single processor (𝑇𝑠 ) to the time required to solve the
same problem on a parallel computer with p identical
processing elements (𝑇𝑝 )
𝑻𝒔
𝒔=
𝑻𝒑
Efficiency:
• A measure of the fraction of time for which a
processing element is usefully employed.
𝒔
𝑬=
𝑷
22
Performance Analysis:
Cost (C):
• The product of parallel runtime and the number of
processing elements used.
𝑪 = 𝐩 ∗ 𝑻𝒑
23
Example:
Adding n numbers by using a logical binary tree of n
processing elements.
Solution:
• 𝑻𝒔 = 𝑶 𝒏
• 𝑻𝒑 = 𝑶 (log 𝑛)
𝑻𝒔 𝑶 𝒏 𝑛
•𝒔= = = O( )
𝑻𝒑 𝑶 (log 𝑛) log 𝑛
𝑛
𝒔 O( ) 1
log 𝑛
•𝑬= = = O( )
𝑷 𝑛 log 𝑛
• 𝑪 = 𝐩 ∗ 𝑻𝒑 = 𝑶 (𝑛 log 𝑛)
• This parallel algorithm is not optimal (𝐶 ≠ 𝑛)
24