0% found this document useful (0 votes)

216 views

MC Openmp

This document provides instructions and notes for a series of exercises on parallel programming with OpenMP. It introduces basic OpenMP concepts like parallel regions and work sharing directives through simple "Hello World" examples. Subsequent exercises demonstrate applying OpenMP to real problems like calculating the area of the Mandelbrot set and reconstructing an image from edge data in parallel. Later exercises cover load balancing via scheduling directives and testing the Goldbach conjecture in parallel.

Uploaded by

Bui Khoa Nguyen Dang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

216 views

MC Openmp

Uploaded by

Bui Khoa Nguyen Dang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Multicore Programming with OpenMP Exercise Notes

September 7, 2009

Getting started
If you are using ness you can copy a tar le containing the code directly to your course account: [user@ness ]$ cp /home/s05/course00/multicore/openmp.tar . If you are using your own machine, then you can download the les from: http://www.epcc.ed.ac.uk/etg/multicore If you are using a Unix/Linux system, or Cygwin then you should download openmp.tar Unpack the tar le with the command tar xvf openmp.tar If you are using a Windows compiler then you should download openmp.zip and use standard Windows tools to unpack it.

Exercise 1: Hello World

This is a simple exercise to introduce you to the compilation and execution of OpenMP programs. The example code can be found in */HelloWorld/ where the * represents the language of your choice, i.e. C, or Fortran90. Compile the code, making sure you use the appropriate ag to enable OpenMP. Before running it, set the environment variable OMP NUM THREADS to a number n between 1 and 4 with the command: export OMP_NUM_THREADS=n When run, the code enters a parallel region at the !$OMP PARALLEL/#pragma omp parallel command. At this point n threads are spawned, and each thread executes the print command separately. The OMP GET THREAD NUM()/ omp get thread num() library routine returns a number (between 0 and n-1) which identies each thread. 1

Extra Exercise
Incorporate a call to omp get num threads() into the code and print its value within and outside of the parallel region.

Exercise 2: Area of the Mandelbrot Set

The aim of this exercise is to use the OpenMP directives learned so far and apply them to a real problem. It will demonstrate some of the issues which need to be taken into account when adapting serial code to a parallel version.

The Mandelbrot Set

The Mandelbrot Set is the set of complex numbers c for which the iteration z = z 2 + c does not diverge, from the initial condition z = c. To determine (approximately) whether a point c lies in the set, a nite number of iterations are performed, and if the condition |z| > 2 is satised then the point is considered to be outside the Mandelbrot Set. What we are interested in is calculating the area of the Mandelbrot Set. There is no known theoretical value for this, and estimates are based on a procedure similar to that used here. We will use Monte Carlo sampling to calculate a numerical solution to this problem. N.B. Complex numbers in C: Complex arithmetic is supported directly in Fortran, but in C it must be done manually. For this exercise the template has a struct complex declared. Then for a given complex number z, z.creal represents the real part and z.cimag the imaginary part. Then assigning z = z 2 + c is given by: ztemp=(z.creal*z.creal)-(z.cimag*z.cimag)+c.creal; z.cimag=z.creal*z.cimag*2+c.cimag; z.creal=ztemp; using double ztemp as temporary storage for the updated real part of z. Also, for the threshold test |z| > 2.0, use abs(z)>2.0 in Fortran, and in C use: z.creal*z.creal+z.cimag*z.cimag>4.0.

The Code : Sequential

The Monte Carlo method we shall use generates a number of random points in the range [(2.0, 0), (0.5, 1.125)] of the complex plane. Then each point will be iterated using the equation above a nite number of times (say 100000). If within that number of iterations the threshold condition |z| > 2 is satised then that point is considered to be outside of the Mandelbrot Set. Then counting the number of random points within the Set and those outside will reveal a good approximation of the area of the Set. The template for this code can be found in */Mandelbrot/. Implement the Monte Carlo sampling, looping over all npoints points. For each point, assign z=c(i) and iterate as dened in the earlier equation, testing for the threshold condition at each iteration. If the threshold condition is not satised, i.e. |z| 2, then

repeat the iteration (up to a maximum number of iterations, say 100,000). If after the maximum number of iterations the condition is still not satised then add one to the total number of points inside the Set. If the threshold condition is satised, i.e. |z| > 2 then stop iterating and move on to the next point.

Parallelisation
Now parallelise the serial code using the OpenMP directives and library routines that you have learned so far. The method for doing this is as follows 1. Start a parallel region before the main Monte Carlo iteration loop, making sure that any private, shared or reduction variables within the region are correctly declared. 2. Distribute the npoints random points across the n threads available so that each thread has an equal number of the points. For this you will need to use some of the OpenMP library routines. N.B.: there is no ceiling function in Fortran77, so to calculate the number of points on each thread (nlocal) compute (npoints+n-1)/n instead of npoints/n. Then for each thread the iteration bounds are given by: ilower = nlocal * myid + 1 iupper = min(npoints, (myid+1)*nlocal) where myid stores the thread identication number. 3. Rewrite the iteration loop to take into account the new distribution of points. 4. End the parallel region after the iteration loop. Once you have written the code try it out using 1, 2, 3 and 4 threads. Check that the results are identical in each case, and compare the time taken for the calculations using the different number of threads.

Extra Exercise
Write out which iterations are being performed by each thread to make sure all the points are being iterated.

Exercise 3: Image
The aim of this exercise is to use the OpenMP worksharing directives, specically the PARALLEL DO/parallel for directives. This code reconstructs an image from a version which has been passed through a a simple edge detection lter. The edge pixel values are constructed from the image using

edgei,j = imagei1,j + imagei+1,j + image1,j1 + imagei,j+1 4 imagei,j If an image pixel has the same value as its four surrounding neighbours (i.e. no edge) then the value of edgei,j will be zero. If the pixel is very different from its four neighbours (i.e. a possible edge) then edgei,j will be large in magnitude. We will always consider i and j to lie in the range 1, 2, . . . M and 1, 2, . . . N respectively. Pixels that lie outside this range (e.g. imagei,0 or imageM+1,j ) are simply considered to be set to zero. Many more sophisticated methods for edge detection exist, but this is a nice simple approach. The exercise is actually to do the reverse operation and construct the initial image given the edges. This is a slightly articial thing to do, and is only possible given the very simple approach used to detect the edges in the rst place. However, it turns out that the reverse calculation is iterative, requiring many successive operations each very similar to the edge detection calculation itself. The fact that calculating the image from the edges requires a large amount of computation, makes it a much more suitable program than edge detection itself for the purposes of timing and parallel scaling studies. As an aside, this inverse operation is also very similar to a large number of real scientic HPC calculations that solve partial differential equations using iterative algorithms such as Jacobi or Gauss-Seidel.

Parallelisation
The code for this exercise can be found in */Image/. The le notlac.dat contains the initial data. The code can be parallelised by adding PARALLEL DO/parallel for directives to the three routines new2old, jacobistep and residue, taking care to correctly identify shared, private and reduction variables. Hint: use default(none). Remember to add the OpenMP compiler ag. To test for correctness, compare the nal residue value to that from the sequential code. Run the code on different numbers of threads. Try to answer the following questions: Where is synchronisation taking place? Where is communication taking place?

Extra exercise
Redo Exercise 2 using worksharing directives.

Extra exercise (Fortran90 only)

Implement the routine new2old using Fortran 90 array syntax, and parallelise with a PARALLEL WORKSHARE directive. Compare the performance of the two versions. 4

Exercise 4: Load imbalance

This exercise will demonstrate the use of the SCHEDULE/schedule clause to improve performance of a parallelised code.

The Goldbach Conjecture

In the world of number theory, the Goldbach Conjecture states that: Every even number greater than two is the sum of two primes. In this exercise you are going to write a code to test this conjecture. The code will nd the number of Goldbach pairs (that is, number of pairs of primes p1 , p2 such that p1 + p2 = i) for i = 2, . . . , 8000. The computational cost is proportional to i3/2 , so for optimal performance on multiple threads the work load must be intelligently distributed using the OpenMP SCHEDULE/schedule clause.

The Code
The template for the code can be found in */Goldbach/. 1. Initialise the array numpairs to zero. 2. Create a do/for loop which loops over i and sets each element of the array numpairs to the returned value of a function goldbach(2*i). This array will contain the number of Goldbach pairs for each even number up to 8000. 3. Once the number of Goldbach pairs for each even number has been calculated, test that the Goldbach conjecture holds true for these numbers, that is, that for each even number greater than two there is at least one Goldbach pair. Your output should match these results: Even number 2 802 1602 2402 3202 4002 4802 5602 6402 7202 Goldbach pairs 0 16 53 37 40 106 64 64 156 78

Table 1: Example output for Goldbach code

Parallelisation
Parallelise the code using OpenMP directives around the do/for loop which calls the goldbach function. Validate that the code is still working when using more than one thread, and then experiment with the schedule clause to try and improve performance.

Extra Exercise
1. Print out which thread is doing each iteration so that you can see directly the effect of the different scheduling clauses. 2. Try running the loop in reverse order, and using the GUIDED schedule.

Exercise 5: Molecular Dynamics

The aim of this exercise is to demonstrate how to use several of the OpenMP directives to parallelise a molecular dynamics code.

The Code
The code can be found in */md/. The code is a molecular dynamics (MD) simulation of argon atoms in a box with periodic boundary conditions. The atoms are initially arranged as a face-centred cubic (fcc) lattice and then allowed to melt. The interaction of the particles is calculated using a Lennard-Jones potential. The main loop of the program is in the le main.[f|c|f90]. Once the lattice has been generated and the forces and velocities initialised, the main loop begins. The following steps are undertaken in each iteration of this loop: 1. The particles are moved based on their velocities, and the velocities are partially updated (call to domove) 2. The forces on the particles in their new positions are calculated and the virial and potential energies accumulated (call to forces) 3. The forces are scaled, the velocity update is completed and the kinetic energy calculated (call to mkekin) 4. The average particle velocity is calculated and the temperature scaled (call to velavg) 5. The full potential and virial energies are calculated and printed out (call to prnout)

Parallelisation
The parallelisation of this code is a little less straightforward. There are several dependencies within the program which will require use of the ATOMIC/atomic directive as well as the REDUCTION/reduction clause. The instructions for parallelising the code are as follows: 6

1. Edit the subroutine/function forces.[f|c|f90]. Start a !$OMP PARALLEL DO / #pragma omp parallel for for the outer loop in this subroutine, identifying any private or reduction variables. Hint: There are 2 reduction variables. 2. Identify the variable within the loop which must be updated atomically and use !$OMP ATOMIC/#pragma omp atomic to ensure this is the case. Hint: Youll need 6 atomic directives. Once this is done the code should be ready to run in parallel. Compare the output using 2, 3 and 4 threads with the serial output to check that it is working. Try adding the schedule clause with the option static,n to the DO/for directive for different values of n. Does this have any effect on performance?

Exercise 6: Molecular Dynamics Part II

Following on from the previous exercise, we shall update the molecular dynamics code to take advantage of orphaning and examine the performance issues of using the atomic directive.

Orphaning
To reduce the overhead of starting and stopping the threads, you can change the PARALLEL DO/omp parallel for directive to a DO/for directive and start the parallel region outside the main loop of the program. Edit the main.[f|c|f90] le. Start a PARALLEL/parallel region before the main loop. End the region after the end of the loop. Except for the forces routine, all the other work in this loop should be executed by one thread only. Ensure that this is the case using the SINGLE or MASTER directive. Recall that any reduction variables updated in a parallel region but outside of the DO/for directive should be updated by one thread only. As before, check your code is still working correctly. Is there any difference in performance compared with the code without orphaning?

Multiple Arrays
The use of the atomic directive is quite expensive. One way round the need for the directive in this molecular dynamics code is to create a new temporary array for the variable f. This array can then be used to collect the updated values of f from each thread, which can then be reduced back into f. 1. Edit the main.[f|c|f90] le. Create an array ftemp of dimensions (npart, 3, 0:MAXPROC-1) for Fortran, [3*npart][MAXPROC] for C, where MAXPROC is a constant value which the number of threads used will not exceed. This is the array to be used to store the updates on f from each thread. 2. Within the parallel region, create a variable nprocs and assign to it the number of threads being used. 7

3. The variables ftemp and nprocs and the parameter MAXPROC need to be passed to the forces routine. (N.B. In C it is easier, but rather sloppy, to declare ftemp as a global variable). 4. In the forces routine dene a variable my id and set it to the thread number). This variable can now be used as the index for the last dimension of the array ftemp. Initialise ftemp to zero. 5. Within the parallelised loop, remove the atomic directives and replace the references to the array f with ftemp. 6. After the loop, the array ftemp must now be reduced into the array f. Loop over the number of particles (npart) and the number of threads (nprocs) and sum the temporary array into f The code should now be ready to try again. Check its still working and see if the performance (especially the scalability) has improved.

Extra exercise (Fortran only)

Use a reduction clause for f instead of the temporary arrays. Compare the performance of the two versions.

Exercise 8: Performance analysis

N.B. these instructions are for ness only. If you wish to conduct this excercise on another machine, please speak to a demonstrator. This exercise is intended to use a proling tool to do some performance analysis. */MolDyn4 contains a correct, but not very efcient solution to the MolDyn example. First of all, run a sequential version of the code (by compiling without OpenMP, and note the execution time. To prole the code, compile with -qp in addition to other ags. Run the code on the back end, say on 4 threads. This will generate a le gmon.out. The command gprof ./md will generate proling information. Compute the ideal time (the sequential time divided by the number of threads) and compare it to the actual execution time. Estimate how much of the overhead is due to Sequential code (code not in the parallel region) Load imbalance ( mp barrierp) Synchronisation (other routines starting with mp)

Appendix: OpenMP on ness

Hardware
For this course you can compile and run code on the EPCC HPC Service. This is provided on a Sun X4600 system (ness) confugred as follows 8

A 2-processor front-end used for compilation, debugging and testing of parallel jobs, and submission of jobs to the back-end. One 16-processor and one 8-processor back-ends used for production runs of parallel jobs.

Compiling using OpenMP

The OpenMP compilers we use are the Portland Group compilers for Fortran 90 and C. To compile an OpenMP code, simply: Fortran (pgf90): add the ag -mp C (pgcc): add the ag -mp

Using a Makele
The Makele below is a typical example of the Makeles used in the exercises. The option -O3 is a standard optimisation ag for speeding up execution of the code. ## Fortran compiler and options FC= pgf90 -O3 -mp ## Object files OBJ= main.o \ sub.o ## Compile execname: $(OBJ) $(FC) -o $@ $(OBJ) .f.o: $(FC) -c $< ## Clean out object files and the executable. clean: rm *.o execname

Job Submission
Batch processing is very important on the HPC service, because it is the only way of accessing the back-end. Interactive access is not allowed. For doing timing runs you must use the back-end. To do this, you should submit a batch job as follows, for example using 4 threads: qsub -cwd -pe omp 4 scriptfile where scriptfile is a shell script containing 9

#! /usr/bin/bash export OMP_NUM_THREADS=4 ./execname The -cwd ag ensures the output les are written to the current directory. The -pe ag requestes 4 processors. Note that with OpenMP codes the number of processors requested should be the same as the environment variable OMP NUM THREADS. This can be ensured by using: export OMP_NUM_THREADS=$NSLOTS You can monitor your jobs status with the qstat command, or the qmon utility. Jobs can be deleted with qdel (or via the qmon utility).

Method Statement For Installation For Curtain Wall (Unitized)
80% (30)
Method Statement For Installation For Curtain Wall (Unitized)
38 pages
Astm Tables
No ratings yet
Astm Tables
24 pages
Cs294a 2011 Assignment
No ratings yet
Cs294a 2011 Assignment
5 pages
20BCE260
No ratings yet
20BCE260
13 pages
Exercises
No ratings yet
Exercises
12 pages
EEE434 - 591 Project 2
No ratings yet
EEE434 - 591 Project 2
3 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
NN Lab2
No ratings yet
NN Lab2
5 pages
Micro
No ratings yet
Micro
30 pages
Day 2 1 Advanced-Openmp
No ratings yet
Day 2 1 Advanced-Openmp
52 pages
Lab4 MultipleLinearRegression
No ratings yet
Lab4 MultipleLinearRegression
7 pages
Exercise 1 (Openmp-I)
No ratings yet
Exercise 1 (Openmp-I)
10 pages
ML Lab 11 Manual - Neural Networks (Ver4)
No ratings yet
ML Lab 11 Manual - Neural Networks (Ver4)
8 pages
ML Coursera Python Assignments
No ratings yet
ML Coursera Python Assignments
20 pages
LABORATORY WORK 3 - EXERCISES 3 To 4
No ratings yet
LABORATORY WORK 3 - EXERCISES 3 To 4
9 pages
Design & Analysis of Algorithms (DAA) Unit - I
No ratings yet
Design & Analysis of Algorithms (DAA) Unit - I
18 pages
Experiment No. 1: Development of Programs in C++ For Demonstration of Various Types of Functions
No ratings yet
Experiment No. 1: Development of Programs in C++ For Demonstration of Various Types of Functions
33 pages
DSA Complexity
No ratings yet
DSA Complexity
20 pages
CS 201, Spring 2021: Homework Assignment 2
No ratings yet
CS 201, Spring 2021: Homework Assignment 2
4 pages
Lab1 1
No ratings yet
Lab1 1
10 pages
Project 1 - ANN With Backprop
No ratings yet
Project 1 - ANN With Backprop
3 pages
Introduction To Algorithm Analysis and Design
No ratings yet
Introduction To Algorithm Analysis and Design
18 pages
Algorithm&Data Structures
No ratings yet
Algorithm&Data Structures
53 pages
4WM20 - Exercises Lecture 2: 1 Creating Your Own Bode Plot
No ratings yet
4WM20 - Exercises Lecture 2: 1 Creating Your Own Bode Plot
3 pages
Daa Unit-I
No ratings yet
Daa Unit-I
11 pages
Ma324: TP2
No ratings yet
Ma324: TP2
1 page
MODULE-1 PART-1
No ratings yet
MODULE-1 PART-1
28 pages
5th Sem ADA Manual
No ratings yet
5th Sem ADA Manual
41 pages
Python Numerical Computing With Numpy
No ratings yet
Python Numerical Computing With Numpy
21 pages
Lab Assignment3
No ratings yet
Lab Assignment3
8 pages
Looping Statements
No ratings yet
Looping Statements
14 pages
Practice Exercise (Loops)
No ratings yet
Practice Exercise (Loops)
6 pages
National University of Computer and Emerging Sciences, Lahore Campus
No ratings yet
National University of Computer and Emerging Sciences, Lahore Campus
9 pages
Lab Manual
No ratings yet
Lab Manual
31 pages
PDC LAB Experiment 2
No ratings yet
PDC LAB Experiment 2
12 pages
MAP55611-1-2
No ratings yet
MAP55611-1-2
6 pages
MATLAB Structure and Use: 1.1 Pre-Lab Assignment
No ratings yet
MATLAB Structure and Use: 1.1 Pre-Lab Assignment
29 pages
CS213_293_2023
No ratings yet
CS213_293_2023
40 pages
DAA unit1 answers
No ratings yet
DAA unit1 answers
13 pages
Sol CH 5
No ratings yet
Sol CH 5
14 pages
Dot NeT Technology
No ratings yet
Dot NeT Technology
36 pages
Matlab Chapter3
No ratings yet
Matlab Chapter3
16 pages
MPI The Best Possible Practise
No ratings yet
MPI The Best Possible Practise
5 pages
ADS Unit - 1
100% (1)
ADS Unit - 1
23 pages
Lab 1
No ratings yet
Lab 1
7 pages
Chapter Parallel Prefix Sum
No ratings yet
Chapter Parallel Prefix Sum
21 pages
Note 2
No ratings yet
Note 2
30 pages
Matlabadadfsfsf
No ratings yet
Matlabadadfsfsf
2 pages
Master in High Performance Computing Advanced Parallel Programming LABS
No ratings yet
Master in High Performance Computing Advanced Parallel Programming LABS
2 pages
OpenMP 2
No ratings yet
OpenMP 2
3 pages
(R18A0507) Design and Analysis of Algorithms-6-13
No ratings yet
(R18A0507) Design and Analysis of Algorithms-6-13
8 pages
aoop-practical3-2100032044 (1)
No ratings yet
aoop-practical3-2100032044 (1)
6 pages
MIT6 189IAP11 hw2
No ratings yet
MIT6 189IAP11 hw2
8 pages
Prog Numerically
No ratings yet
Prog Numerically
57 pages
TP2
No ratings yet
TP2
4 pages
ID6001_Homework_2b57bb1d39ec7c53700fa31dc04520dc
No ratings yet
ID6001_Homework_2b57bb1d39ec7c53700fa31dc04520dc
2 pages
PAR Final Lab Sol 2023 24Q1
No ratings yet
PAR Final Lab Sol 2023 24Q1
3 pages
5CS037 - WS01 - Numpy For Matrix Manipulation
No ratings yet
5CS037 - WS01 - Numpy For Matrix Manipulation
20 pages
Worksharing and Parallel Loops
No ratings yet
Worksharing and Parallel Loops
23 pages
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
Compact Verilog-A Model of Phase-Change RAM Transient Behaviors For Mult...
100% (1)
Compact Verilog-A Model of Phase-Change RAM Transient Behaviors For Mult...
11 pages
Urban Planning For Disaster Risk Reduction
No ratings yet
Urban Planning For Disaster Risk Reduction
13 pages
Woodfall Revised
100% (10)
Woodfall Revised
96 pages
Implant Introduction
100% (2)
Implant Introduction
145 pages
Vacuum Techniques
No ratings yet
Vacuum Techniques
3 pages
9 Dai Tra
100% (1)
9 Dai Tra
4 pages
SL2018-661 - New Piston Ring Pack
No ratings yet
SL2018-661 - New Piston Ring Pack
1 page
Noka Drink New
No ratings yet
Noka Drink New
8 pages
Chemistry BRIDGE COURSE Chem 12th
No ratings yet
Chemistry BRIDGE COURSE Chem 12th
16 pages
Grinding and Unconventional Machining
No ratings yet
Grinding and Unconventional Machining
2 pages
Karakamsa
No ratings yet
Karakamsa
1 page
IndyMill Parts List
No ratings yet
IndyMill Parts List
1 page
Future of Green Ironmaking Using BF - BOF Route
No ratings yet
Future of Green Ironmaking Using BF - BOF Route
20 pages
Your Dosha Profile Is Shown Below.: I. Bhavesh's Current Ayurvedic Profile
100% (1)
Your Dosha Profile Is Shown Below.: I. Bhavesh's Current Ayurvedic Profile
15 pages
I3C Protocol
No ratings yet
I3C Protocol
5 pages
Terms Report PDF
No ratings yet
Terms Report PDF
93 pages
FF1. Unit 5, Park
No ratings yet
FF1. Unit 5, Park
74 pages
ELPLA
No ratings yet
ELPLA
315 pages
BD Vacutainer® Venous Blood Collection Tube Guide
No ratings yet
BD Vacutainer® Venous Blood Collection Tube Guide
1 page
Phenolic: The Offshore FRP Grating Standard
No ratings yet
Phenolic: The Offshore FRP Grating Standard
2 pages
WGCP Plant Sow - 07!06!2023
No ratings yet
WGCP Plant Sow - 07!06!2023
15 pages
Multi - 1 Drills: YE-ML20
No ratings yet
Multi - 1 Drills: YE-ML20
2 pages
MC Module Questions With Key Answers
No ratings yet
MC Module Questions With Key Answers
15 pages
Summative Assessment
No ratings yet
Summative Assessment
5 pages
高效练耳朵中级
No ratings yet
高效练耳朵中级
20 pages
WP49-PUE A Comprehensive Examination of The Metric - v6
100% (1)
WP49-PUE A Comprehensive Examination of The Metric - v6
83 pages
Chemical Industry Analysis
No ratings yet
Chemical Industry Analysis
4 pages
Setup Manual: SC-F9300 Series Rev. B
No ratings yet
Setup Manual: SC-F9300 Series Rev. B
68 pages