Performance
Scalability
Informal definition: More powerful system means more speedup
Formal definition:
We fix p and N, and find E
We increase p
If there is a new N so that E remains same, program is scalable
Performance
Scalability
We know: E = S / p, and S = Tserial/Tparallel, and Tparallel = Tserial / p + Toverhead
E = Tserial / (pTparallel) = Tserial / (Tserial + p Toverhead)
Tserial is a function of input size N
E = Tserial(N) / [Tserial(N) + p Toverhead]
Let’s consider Bubble Sort for which Tserial(N) ≈ N2
E = N2 / [N2 + p Toverhead]
Performance
Scalability
Let’s increase p by a factor of k and N by a factor of x
Let’s assume Toverhead changes by a factor of m
Enew = (xN)2 / [(xN)2 + kp mToverhead] = x2N2 / [x2N2 + (km) p Toverhead]
If x2 = km, then Enew is equal to E, and program is scalable
If E remains same without increasing N, program is strongly scalable
If E remains same by increasing N at the same rate as p, program is
weakly scalable
Timing Parallel Programs
To find know behavior during dev (is there a bottleneck)
To evaluate performance after dev
We want time the data processing part of the code, not the I/O
Generally not interested in CPU time
It includes libraries and system call time as well
It doesn’t include the idle time (could be a problem)
Timing Parallel Programs
APIs provide functions to time the code
MPI_Wtime, omp_get_wtime are two examples
Both return wall clock time and not the CPU time
Timer resolution is an important parameter
Linux provides timers with nano-second resolution
Need to check resolution before using
Timing Parallel Programs
In parallel programs code is run by multiple threads/processes
We want to time all threads/processes
In distributed-memory programs nodes have independent clocks
Every run gives different set of values (expected)
Report the minimum value
Parallel Program Design
Foster’s Methodology
Partitioning (divide data or tasks)
Communication (what type is needed among tasks)
Aggregation or agglomeration (combine tasks into composite
tasks)
Mapping (assign tasks to processors)
Parallel Program Design
Foster’s Methodology (an example)
We want histogram of a large array of floats
First focus on serial solution (very simple and straight-forward)
Input to the program
1. Number of elements in array 2. Array of floats 3. Minimum
value 4. Maximum value 5. Number of bins
Output of the program is an array of number of items in each bin
Parallel Program Design
Foster’s Methodology (an example)
If the data items are 1.3, 2.9, 0.4,
0.3, 1.3, 4.4, 1.7, 0.4, 3.2, 0.3, 4.9,
2.4, 3.1, 4.4, 3.9, 0.4, 4.2, 4.5, 4.9,
0.9 the histogram will look like →
Parallel Program Design
Foster’s Methodology (an example)
Using min_meas, max_meas values find bin_width
bin_width = (max_meas – min_meas) / bin_count;
Initialize an array bin_maxes to hold upper limits (floats)
for (b = 0 ; b < bin_count ; ++b)
bin_maxes[b] = min_meas + bin_width * (b + 1);
Initialize an array bin_counts for the item count in each bin
Parallel Program Design
Foster’s Methodology (an example)
The bin number b will have all floats n that satisfy the inequality
bin_maxes[b – 1] <= n < bin_maxes[b]
If b = 0 then the inequality is
min_meas <= n < bin_maxes[0]
Find_bin is a function that tells which bin the data item belongs
For small number of bins a linear search would be good enough
Parallel Program Design
Foster’s Methodology (an example)
for ( i = 0 ; i < data_count ; i ++) {
bin =
Find_bin(data[i],bin_maxes,bin_count,min_meas);
bin_counts[bin]++;
}
Parallel Program Design
Foster’s Methodology (an example)
Step 1: Two types of tasks (finding bin and updating count)
Step 2: Communication is through the variable bin
Step 3: Finding bin and ++count can be aggregated (sequential)
Parallel Program Design
Foster’s Methodology (an example)
Step 4: Okay Houston . . . We have a problem here!
Two threads might try to execute ++bin_counts[b] at once
If bin_count or thread count is not too big make bin_count
local
With 1000 bins and 500 threads we only add 500,000 integers
Parallel Program Design
Foster’s Methodology (an example)
We have added another task (increment local bin counts)
Necessary to avoid race condition and inaccurate bin counts
We will write the code for this problem soon using POSIX threads
Shared Memory Parallel Programming with POSIX Threads
POSIX Threads aka Pthreads is a widely available library
New features are added frequently
Sometimes called light-weight processes (not Solaris LWP)
We create a Pthread and give it a function to execute
If a Pthread completes its task, it can terminate itself
A Pthread can receive info from its parent
It can also return info to its parent
COMP410
Shared Memory Parallel Programming with POSIX Threads
A Pthread can receive info from its parent
It can also return info to its parent
Creating Pthread:
int pthread_create(pthread_t *restrict thread,
const pthread_attr_t *restrict
attr,
void *(*start_routine)(void *),
void *restrict arg);
The man page of this functionCOMP410
contains wealth of information
Shared Memory Parallel Programming with POSIX Threads
A thread terminates by calling
void pthread_exit(void *retval);
retval points to data which is available to a thread that joins
A thread can wait for another thread to terminate by calling
int pthread_join(pthread_t thread, void **retval);
If retval is not NULL, it points to the pointer returned by
pthread_exit
COMP410
Shared Memory Parallel Programming with POSIX Threads
This is how joining and
joined threads can
communicate
Not joining joinable threads
creates zombie threads
Too many zombies can
cause pthread_create to
fail
COMP410
Shared Memory Parallel Programming with POSIX Threads
Demo program
COMP410
Shared Memory: Matrix-vector multiplication
A fairly large number of disciplines need it
Let A have m rows and n columns and x have n rows
The product A*x is a vector with n rows (components)
The i-th component of A*x is the dot product of i-th row of A
and x
y[i] = a[i][0]*x[0]+a[i][1]*x[1]+...+a[i][n-
1]*x[n-1]
COMP410
Shared Memory: Matrix-vector multiplication
There are n products which are added to get one component of y
n2 products are computed to the final answer
Easily to parallelize
If A has m rows and there are p cores give m/p rows to each
core
Of course if ((m%p)!=0), some cores get more rows
COMP410
Shared Memory: Matrix-vector multiplication
Where to store A, x and y
For now make them global (see Exercise 4.2 for some issues)
Neither A nor x is modified by any thread
Each threads computes it assigned components of y
Multi-threaded matrix-vector multiplication code will be posted
We will measure execution time for various matrix sizes
We will also use this program to study cache effects
COMP410