0% found this document useful (0 votes)
25 views23 pages

410A Week 5

The document discusses performance and scalability in parallel programming, defining scalability in terms of efficiency (E) and overhead. It outlines Foster's Methodology for parallel program design, emphasizing task partitioning, communication, and aggregation. Additionally, it covers shared memory parallel programming using POSIX threads, including thread creation, termination, and communication, with a focus on matrix-vector multiplication as a practical example.

Uploaded by

261905138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views23 pages

410A Week 5

The document discusses performance and scalability in parallel programming, defining scalability in terms of efficiency (E) and overhead. It outlines Foster's Methodology for parallel program design, emphasizing task partitioning, communication, and aggregation. Additionally, it covers shared memory parallel programming using POSIX threads, including thread creation, termination, and communication, with a focus on matrix-vector multiplication as a practical example.

Uploaded by

261905138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Performance


Scalability

Informal definition: More powerful system means more speedup

Formal definition:

We fix p and N, and find E

We increase p

If there is a new N so that E remains same, program is scalable
Performance


Scalability

We know: E = S / p, and S = Tserial/Tparallel, and Tparallel = Tserial / p + Toverhead

E = Tserial / (pTparallel) = Tserial / (Tserial + p Toverhead)

Tserial is a function of input size N

E = Tserial(N) / [Tserial(N) + p Toverhead]

Let’s consider Bubble Sort for which Tserial(N) ≈ N2

E = N2 / [N2 + p Toverhead]
Performance

Scalability

Let’s increase p by a factor of k and N by a factor of x

Let’s assume Toverhead changes by a factor of m

Enew = (xN)2 / [(xN)2 + kp mToverhead] = x2N2 / [x2N2 + (km) p Toverhead]

If x2 = km, then Enew is equal to E, and program is scalable

If E remains same without increasing N, program is strongly scalable

If E remains same by increasing N at the same rate as p, program is
weakly scalable
Timing Parallel Programs

To find know behavior during dev (is there a bottleneck)

To evaluate performance after dev

We want time the data processing part of the code, not the I/O

Generally not interested in CPU time

It includes libraries and system call time as well

It doesn’t include the idle time (could be a problem)
Timing Parallel Programs

APIs provide functions to time the code

MPI_Wtime, omp_get_wtime are two examples

Both return wall clock time and not the CPU time

Timer resolution is an important parameter

Linux provides timers with nano-second resolution

Need to check resolution before using
Timing Parallel Programs

In parallel programs code is run by multiple threads/processes

We want to time all threads/processes

In distributed-memory programs nodes have independent clocks

Every run gives different set of values (expected)

Report the minimum value
Parallel Program Design

Foster’s Methodology

Partitioning (divide data or tasks)

Communication (what type is needed among tasks)

Aggregation or agglomeration (combine tasks into composite
tasks)

Mapping (assign tasks to processors)
Parallel Program Design

Foster’s Methodology (an example)

We want histogram of a large array of floats

First focus on serial solution (very simple and straight-forward)

Input to the program

1. Number of elements in array 2. Array of floats 3. Minimum
value 4. Maximum value 5. Number of bins

Output of the program is an array of number of items in each bin
Parallel Program Design

Foster’s Methodology (an example)

If the data items are 1.3, 2.9, 0.4,
0.3, 1.3, 4.4, 1.7, 0.4, 3.2, 0.3, 4.9,
2.4, 3.1, 4.4, 3.9, 0.4, 4.2, 4.5, 4.9,
0.9 the histogram will look like →
Parallel Program Design

Foster’s Methodology (an example)

Using min_meas, max_meas values find bin_width

bin_width = (max_meas – min_meas) / bin_count;

Initialize an array bin_maxes to hold upper limits (floats)

for (b = 0 ; b < bin_count ; ++b)

bin_maxes[b] = min_meas + bin_width * (b + 1);

Initialize an array bin_counts for the item count in each bin
Parallel Program Design

Foster’s Methodology (an example)

The bin number b will have all floats n that satisfy the inequality

bin_maxes[b – 1] <= n < bin_maxes[b]

If b = 0 then the inequality is

min_meas <= n < bin_maxes[0]

Find_bin is a function that tells which bin the data item belongs

For small number of bins a linear search would be good enough
Parallel Program Design

Foster’s Methodology (an example)

for ( i = 0 ; i < data_count ; i ++) {
bin =
Find_bin(data[i],bin_maxes,bin_count,min_meas);
bin_counts[bin]++;
}
Parallel Program Design

Foster’s Methodology (an example)

Step 1: Two types of tasks (finding bin and updating count)

Step 2: Communication is through the variable bin

Step 3: Finding bin and ++count can be aggregated (sequential)
Parallel Program Design

Foster’s Methodology (an example)

Step 4: Okay Houston . . . We have a problem here!

Two threads might try to execute ++bin_counts[b] at once

If bin_count or thread count is not too big make bin_count
local

With 1000 bins and 500 threads we only add 500,000 integers
Parallel Program Design

Foster’s Methodology (an example)

We have added another task (increment local bin counts)

Necessary to avoid race condition and inaccurate bin counts

We will write the code for this problem soon using POSIX threads
Shared Memory Parallel Programming with POSIX Threads

POSIX Threads aka Pthreads is a widely available library

New features are added frequently

Sometimes called light-weight processes (not Solaris LWP)

We create a Pthread and give it a function to execute

If a Pthread completes its task, it can terminate itself

A Pthread can receive info from its parent

It can also return info to its parent

COMP410
Shared Memory Parallel Programming with POSIX Threads

A Pthread can receive info from its parent

It can also return info to its parent

Creating Pthread:

int pthread_create(pthread_t *restrict thread,
const pthread_attr_t *restrict
attr,
void *(*start_routine)(void *),
void *restrict arg);

The man page of this functionCOMP410
contains wealth of information
Shared Memory Parallel Programming with POSIX Threads

A thread terminates by calling
void pthread_exit(void *retval);

retval points to data which is available to a thread that joins

A thread can wait for another thread to terminate by calling
int pthread_join(pthread_t thread, void **retval);

If retval is not NULL, it points to the pointer returned by
pthread_exit

COMP410
Shared Memory Parallel Programming with POSIX Threads

This is how joining and
joined threads can
communicate

Not joining joinable threads
creates zombie threads

Too many zombies can
cause pthread_create to
fail
COMP410
Shared Memory Parallel Programming with POSIX Threads
Demo program

COMP410
Shared Memory: Matrix-vector multiplication


A fairly large number of disciplines need it

Let A have m rows and n columns and x have n rows

The product A*x is a vector with n rows (components)

The i-th component of A*x is the dot product of i-th row of A
and x

y[i] = a[i][0]*x[0]+a[i][1]*x[1]+...+a[i][n-
1]*x[n-1]
COMP410
Shared Memory: Matrix-vector multiplication


There are n products which are added to get one component of y

n2 products are computed to the final answer

Easily to parallelize

If A has m rows and there are p cores give m/p rows to each
core

Of course if ((m%p)!=0), some cores get more rows

COMP410
Shared Memory: Matrix-vector multiplication


Where to store A, x and y

For now make them global (see Exercise 4.2 for some issues)

Neither A nor x is modified by any thread

Each threads computes it assigned components of y

Multi-threaded matrix-vector multiplication code will be posted

We will measure execution time for various matrix sizes

We will also use this program to study cache effects

COMP410

You might also like