openmp_HPC_ass1
openmp_HPC_ass1
OpenMP (Part 1)
1
What is OpenMP
Open specifications for Multi Processing
Long version: Open specifications for MultiProcessing via
collaborative work between interested parties from the hardware
and software industry, government and academia.
• An Application Program Interface (API) that is used to explicitly
direct multi-threaded, shared memory parallelism.
• API components:
– Compiler directives
– Runtime library routines
– Environment variables
• Portability
– API is specified for C/C++ and Fortran
– Implementations on almost all platforms including Unix/Linux and
Windows
• Standardization
– Jointly defined and endorsed by major computer hardware and
software vendors
– Possibility to become ANSI standard 2
Brief History of OpenMP
3
4
Thread
• A process is an instance of a computer program that
is being executed. It contains the program code and
its current activity.
• A thread of execution is the smallest unit of
processing that can be scheduled by an operating
system.
• Differences between threads and processes:
– A thread is contained inside a process. Multiple threads
can exist within the same process and share resources
such as memory. The threads of a process share the
latter’s instructions (code) and its context (values that
its variables reference at any given moment).
– Different processes do not share these resources.
http://en.wikipedia.org/wiki/Process_(computing)
5
Process
• A process contains all the information needed to execute
the program
– Process ID
– Program code
– Data on run time stack
– Global data
– Data on heap
Each process has its own address space.
• In multitasking, processes are given time slices in a
round robin fashion.
– If computer resources are assigned to another process, the
status of the present process has to be saved, in order that
the execution of the suspended process can be resumed at a
later time.
6
Threads
7
OpenMP Programming Model
8
OpenMP is not
– Necessarily implemented identically by all vendors
– Meant for distributed-memory parallel systems (it is designed
for shared address spaced machines)
– Guaranteed to make the most efficient use of shared memory
– Required to check for data dependencies, data conflicts, race
conditions, or deadlocks
– Required to check for code sequences
– Meant to cover compiler-generated automatic parallelization
and directives to the compiler to assist such parallelization
– Designed to guarantee that input or output to the same file is
synchronous when executed in parallel.
9
Fork-Join Parallelism
• OpenMP program begin as a single process: the master thread. The
master thread executes sequentially until the first parallel region
construct is encountered.
• When a parallel region is encountered, master thread
– Create a group of threads by FORK.
– Becomes the master of this group of threads, and is assigned the thread id 0
within the group.
• The statement in the program that are enclosed by the parallel region
construct are then executed in parallel among these threads.
• JOIN: When the threads complete executing the statement in the parallel
region construct, they synchronize and terminate, leaving only the
master thread.
11
OpenMP Code Structure
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
int main()
{
#pragma omp parallel
{
int ID = omp_get_thread_num();
printf("Hello (%d)\n", ID);
printf(" world (%d)\n", ID);
}
}
Set # of threads for OpenMP
In csh
setenv OMP_NUM_THREAD 8
Run: ./a.out
See: http://wiki.crc.nd.edu/wiki/index.php/OpenMP 12
OpenMP Core Syntax
#include “omp.h”
int main ()
{
int var1, var2, var3;
// Serial code
...
// Beginning of parallel section.
// Fork a team of threads. Specify variable scoping
#pragma omp parallel private(var1, var2) shared(var3)
{
// Parallel section executed by all threads
...
// All threads join master thread and disband
}
13
OpenMP C/C++ Directive Format
OpenMP directive forms
– C/C++ use compiler directives
• Prefix: #pragma omp …
– A directive consists of a directive name followed by
clauses
Example: #pragma omp parallel default (shared) private (var1,
var2)
14
OpenMP Directive Format (2)
General Rules:
• Case sensitive
• Only one directive-name may be specified per
directive
• Each directive applies to at most one succeeding
statement, which must be a structured block.
• Long directive lines can be “continued” on
succeeding lines by escaping the newline
character with a backslash “\” at the end of a
directive line.
15
OpenMP parallel Region Directive
#pragma omp parallel [clause list]
Typical clauses in [clause list]
• Conditional parallelization
– if (scalar expression)
• Determine whether the parallel construct creates threads
• Degree of concurrency
– num_threads (integer expresson)
• number of threads to create
• Date Scoping
– private (variable list)
• Specifies variables local to each thread
– firstprivate (variable list)
• Similar to the private
• Private variables are initialized to variable value before the parallel directive
– shared (variable list)
• Specifies variables that are shared among all the threads
– default (data scoping specifier)
• Default data scoping specifier may be shared or none
16
Example:
#pragma omp parallel if (is_parallel == 1) num_threads(8) shared (var_b)
private (var_a) firstprivate (var_c) default (none)
{
/* structured block */
}
• if (is_parallel == 1) num_threads(8)
– If the value of the variable is_parallel is one, create 8 threads
• shared (var_b)
– Each thread shares a single copy of variable b
• private (var_a) firstprivate (var_c)
– Each thread gets private copies of variable var_a and var_c
– Each private copy of var_c is initialized with the value of var_c in main
thread when the parallel directive is encountered
• default (none)
– Default state of a variable is specified as none (rather than shared)
– Singals error if not all variables are specified as shared or private
17
Number of Threads
18
Thread Creation: Parallel Region Example
• Create threads with the parallel construct
#include <stdlib.h>
#include <stdio.h>
Clause to request
#include "omp.h"
threads
int main()
{
int nthreads, tid;
#pragma omp parallel num_threads(4) private(tid)
{
tid = omp_get_thread_num();
printf("Hello world from (%d)\n", tid); Each thread executes a
if(tid == 0) copy of the code
{ within the structured
nthreads = omp_get_num_threads(); block
printf(“number of threads = %d\n”, nthreads);
}
} // all threads join master thread and terminates
}
19
Thread Creation: Parallel Region Example
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
int main(){
int nthreads, A[100] , tid;
// fork a group of threads with each thread having a private tid variable
omp_set_num_threads(4);
#pragma omp parallel private (tid)
{
tid = omp_get_thread_num();
A single copy of A[] is shared
foo(tid, A); between all threads
} // all threads join master thread and terminates
}
20
SPMD vs. Work-Sharing
21
Work-Sharing Construct
Do/for
• Shares iterations of a loop across the group
• Represents a “data parallelism”.
for directive partitions parallel iterations across
threads
Do is the analogous directive in Fortran
Usage:
#pragma omp for [clause list]
/* for loop */
• Implicit barrier at end of for loop
23
Example Using for
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
int main()
{
int nthreads, tid;
omp_set_num_threads(3);
#pragma omp parallel private(tid)
{
int i;
tid = omp_get_thread_num();
printf("Hello world from (%d)\n", tid);
#pragma omp for
for(i = 0; i <=4; i++)
{
printf(“Iteration %d by %d\n”, i, tid);
}
} // all threads join master thread and terminates
}
24
Another Example Using for
• Sequential code to add two vectors
for(i=0;i<N;i++) {c[i] = b[i] + a[i];}
26
C/C++ for Directive Syntax
#pragma omp for [clause list]
schedule (type [,chunk])
ordered
private (variable list)
firstprivate (variable list)
shared (variable list)
reduction (operator: variable list)
collapse (n)
nowait
/* for_loop */
27
Reduction
• Serial code
{
double avg = 0.0, A[MAX];
int i;
…
for(i =0; i<MAX; i++) {avg += a[i];}
avg /= MAX;
}
28
Reduction Clause
• Reduction (operator: variable list): specifies how
to combine local copies of a variable in different
threads into a single copy at the master when
threads exit. Variables in variable list are
implicitly private to threads.
– Operators: +, *, -, &, |, ^, &&, and ||
– Usage
#pragma omp parallel reduction(+: sums) num_threads(4)
{
/* compute local sums in each thread
}
/* sums here contains sum of all local instances of sum */
29
Reduction in OpenMP for
• Inside a parallel or a work-sharing construct:
– A local copy of each list variable is made and initialized
depending on operator (e.g. 0 for “+”)
– Compiler finds standard reduction expressions containing
operator and uses it to update the local copy.
– Local copies are reduced into a single value and combined
with the original global value when returns to the master
thread.
{
double avg = 0.0, A[MAX];
int i;
…
#pragma omp parallel for reduction (+:avg)
for(i =0; i<MAX; i++) {avg += a[i];}
avg /= MAX;
}
30
Reduction Operators/Initial-Values
C/C++:
^ 0
&& 1
|| 0
31
Monte Carlo to estimate PI
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
pi = 4.0*count/samples;
printf("Estimate of pi: %7.5f\n", pi);
}
32
OpenMP version of Monte Carlo to Estimate PI
#include <stdio.h>
#include <stdlib.h>
#include “omp.h”
samples = atoi(argv[1]);
36
Static Scheduling
// static scheduling of matrix multiplication loops
#pragma omp parallel default (private) \
shared (a, b, c, dim) num_threads(4)
#pragma omp for schedule(static)
for(i=0; i < dim; i++)
{
for(j=0; j < dim; j++)
{
c[i][j] = 0.0;
for(k=0; j < dim; k++)
c[i][j] += a[i][k]*b[k][j];
} Static schedule maps iterations to threads
at compile time
}
37
Environment Variables
38
By default, worksharing for loops end with an implicit
barrier
• nowait: If specified, threads do not synchronize at the
end of the parallel loop
• ordered: specifies that the iteration of the loop must
be executed as they would be in serial program.
• collapse: specifies how many loops in a nested loop
should be collapsed into one large iteration space and
divided according to the schedule clause. The
sequential execution of the iteration in all associated
loops determines the order of the iterations in the
collapsed iteration space.
39
Avoiding Synchronization with nowait
#pragma omp parallel shared(A,B,C) private(id)
{
id = omp_get_thread_num();
A[id] = big_calc1(id);
#pragma omp barrier Barrier: each threads waits till all threads arrive.
#pragma omp for
for(i = 0; i < N; i++) { C[i] = big_calc3(i,A); }
No implicit
#pragma omp for nowait barrier due to
for(i = 0; i < N; i++) {B[i] = big_calc2(C,i); } nowait. Any
thread can begin
A[id] = big_calc4(id); big_calc4()
immediately
} Implicit barrier without waiting
at the end of the
parallel region for other threads
to finish the loop
40
• By default: worksharing for loops end with an
implicit barrier
• nowait clause:
– Modifies a for directive
– Avoids implicit barrier at end of for
41
Loop Collapse
• Allows parallelization of perfectly nested loops without
using nested parallelism
• Compiler forms a single loop and then parallelizes this
{
…
#pragma omp parallel for collapse (2)
for(i=0;i< N; i++)
{
for(j=0;j< M; j++)
{
foo(A,i,j);
}
}
}
42
For Directive Restrictions
43