MC Openmp
MC Openmp
September 7, 2009
Getting started
If you are using ness you can copy a tar le containing the code directly to your course account: [user@ness ]$ cp /home/s05/course00/multicore/openmp.tar . If you are using your own machine, then you can download the les from: http://www.epcc.ed.ac.uk/etg/multicore If you are using a Unix/Linux system, or Cygwin then you should download openmp.tar Unpack the tar le with the command tar xvf openmp.tar If you are using a Windows compiler then you should download openmp.zip and use standard Windows tools to unpack it.
Extra Exercise
Incorporate a call to omp get num threads() into the code and print its value within and outside of the parallel region.
repeat the iteration (up to a maximum number of iterations, say 100,000). If after the maximum number of iterations the condition is still not satised then add one to the total number of points inside the Set. If the threshold condition is satised, i.e. |z| > 2 then stop iterating and move on to the next point.
Parallelisation
Now parallelise the serial code using the OpenMP directives and library routines that you have learned so far. The method for doing this is as follows 1. Start a parallel region before the main Monte Carlo iteration loop, making sure that any private, shared or reduction variables within the region are correctly declared. 2. Distribute the npoints random points across the n threads available so that each thread has an equal number of the points. For this you will need to use some of the OpenMP library routines. N.B.: there is no ceiling function in Fortran77, so to calculate the number of points on each thread (nlocal) compute (npoints+n-1)/n instead of npoints/n. Then for each thread the iteration bounds are given by: ilower = nlocal * myid + 1 iupper = min(npoints, (myid+1)*nlocal) where myid stores the thread identication number. 3. Rewrite the iteration loop to take into account the new distribution of points. 4. End the parallel region after the iteration loop. Once you have written the code try it out using 1, 2, 3 and 4 threads. Check that the results are identical in each case, and compare the time taken for the calculations using the different number of threads.
Extra Exercise
Write out which iterations are being performed by each thread to make sure all the points are being iterated.
Exercise 3: Image
The aim of this exercise is to use the OpenMP worksharing directives, specically the PARALLEL DO/parallel for directives. This code reconstructs an image from a version which has been passed through a a simple edge detection lter. The edge pixel values are constructed from the image using
edgei,j = imagei1,j + imagei+1,j + image1,j1 + imagei,j+1 4 imagei,j If an image pixel has the same value as its four surrounding neighbours (i.e. no edge) then the value of edgei,j will be zero. If the pixel is very different from its four neighbours (i.e. a possible edge) then edgei,j will be large in magnitude. We will always consider i and j to lie in the range 1, 2, . . . M and 1, 2, . . . N respectively. Pixels that lie outside this range (e.g. imagei,0 or imageM+1,j ) are simply considered to be set to zero. Many more sophisticated methods for edge detection exist, but this is a nice simple approach. The exercise is actually to do the reverse operation and construct the initial image given the edges. This is a slightly articial thing to do, and is only possible given the very simple approach used to detect the edges in the rst place. However, it turns out that the reverse calculation is iterative, requiring many successive operations each very similar to the edge detection calculation itself. The fact that calculating the image from the edges requires a large amount of computation, makes it a much more suitable program than edge detection itself for the purposes of timing and parallel scaling studies. As an aside, this inverse operation is also very similar to a large number of real scientic HPC calculations that solve partial differential equations using iterative algorithms such as Jacobi or Gauss-Seidel.
Parallelisation
The code for this exercise can be found in */Image/. The le notlac.dat contains the initial data. The code can be parallelised by adding PARALLEL DO/parallel for directives to the three routines new2old, jacobistep and residue, taking care to correctly identify shared, private and reduction variables. Hint: use default(none). Remember to add the OpenMP compiler ag. To test for correctness, compare the nal residue value to that from the sequential code. Run the code on different numbers of threads. Try to answer the following questions: Where is synchronisation taking place? Where is communication taking place?
Extra exercise
Redo Exercise 2 using worksharing directives.
The Code
The template for the code can be found in */Goldbach/. 1. Initialise the array numpairs to zero. 2. Create a do/for loop which loops over i and sets each element of the array numpairs to the returned value of a function goldbach(2*i). This array will contain the number of Goldbach pairs for each even number up to 8000. 3. Once the number of Goldbach pairs for each even number has been calculated, test that the Goldbach conjecture holds true for these numbers, that is, that for each even number greater than two there is at least one Goldbach pair. Your output should match these results: Even number 2 802 1602 2402 3202 4002 4802 5602 6402 7202 Goldbach pairs 0 16 53 37 40 106 64 64 156 78
Parallelisation
Parallelise the code using OpenMP directives around the do/for loop which calls the goldbach function. Validate that the code is still working when using more than one thread, and then experiment with the schedule clause to try and improve performance.
Extra Exercise
1. Print out which thread is doing each iteration so that you can see directly the effect of the different scheduling clauses. 2. Try running the loop in reverse order, and using the GUIDED schedule.
The Code
The code can be found in */md/. The code is a molecular dynamics (MD) simulation of argon atoms in a box with periodic boundary conditions. The atoms are initially arranged as a face-centred cubic (fcc) lattice and then allowed to melt. The interaction of the particles is calculated using a Lennard-Jones potential. The main loop of the program is in the le main.[f|c|f90]. Once the lattice has been generated and the forces and velocities initialised, the main loop begins. The following steps are undertaken in each iteration of this loop: 1. The particles are moved based on their velocities, and the velocities are partially updated (call to domove) 2. The forces on the particles in their new positions are calculated and the virial and potential energies accumulated (call to forces) 3. The forces are scaled, the velocity update is completed and the kinetic energy calculated (call to mkekin) 4. The average particle velocity is calculated and the temperature scaled (call to velavg) 5. The full potential and virial energies are calculated and printed out (call to prnout)
Parallelisation
The parallelisation of this code is a little less straightforward. There are several dependencies within the program which will require use of the ATOMIC/atomic directive as well as the REDUCTION/reduction clause. The instructions for parallelising the code are as follows: 6
1. Edit the subroutine/function forces.[f|c|f90]. Start a !$OMP PARALLEL DO / #pragma omp parallel for for the outer loop in this subroutine, identifying any private or reduction variables. Hint: There are 2 reduction variables. 2. Identify the variable within the loop which must be updated atomically and use !$OMP ATOMIC/#pragma omp atomic to ensure this is the case. Hint: Youll need 6 atomic directives. Once this is done the code should be ready to run in parallel. Compare the output using 2, 3 and 4 threads with the serial output to check that it is working. Try adding the schedule clause with the option static,n to the DO/for directive for different values of n. Does this have any effect on performance?
Orphaning
To reduce the overhead of starting and stopping the threads, you can change the PARALLEL DO/omp parallel for directive to a DO/for directive and start the parallel region outside the main loop of the program. Edit the main.[f|c|f90] le. Start a PARALLEL/parallel region before the main loop. End the region after the end of the loop. Except for the forces routine, all the other work in this loop should be executed by one thread only. Ensure that this is the case using the SINGLE or MASTER directive. Recall that any reduction variables updated in a parallel region but outside of the DO/for directive should be updated by one thread only. As before, check your code is still working correctly. Is there any difference in performance compared with the code without orphaning?
Multiple Arrays
The use of the atomic directive is quite expensive. One way round the need for the directive in this molecular dynamics code is to create a new temporary array for the variable f. This array can then be used to collect the updated values of f from each thread, which can then be reduced back into f. 1. Edit the main.[f|c|f90] le. Create an array ftemp of dimensions (npart, 3, 0:MAXPROC-1) for Fortran, [3*npart][MAXPROC] for C, where MAXPROC is a constant value which the number of threads used will not exceed. This is the array to be used to store the updates on f from each thread. 2. Within the parallel region, create a variable nprocs and assign to it the number of threads being used. 7
3. The variables ftemp and nprocs and the parameter MAXPROC need to be passed to the forces routine. (N.B. In C it is easier, but rather sloppy, to declare ftemp as a global variable). 4. In the forces routine dene a variable my id and set it to the thread number). This variable can now be used as the index for the last dimension of the array ftemp. Initialise ftemp to zero. 5. Within the parallelised loop, remove the atomic directives and replace the references to the array f with ftemp. 6. After the loop, the array ftemp must now be reduced into the array f. Loop over the number of particles (npart) and the number of threads (nprocs) and sum the temporary array into f The code should now be ready to try again. Check its still working and see if the performance (especially the scalability) has improved.
A 2-processor front-end used for compilation, debugging and testing of parallel jobs, and submission of jobs to the back-end. One 16-processor and one 8-processor back-ends used for production runs of parallel jobs.
Using a Makele
The Makele below is a typical example of the Makeles used in the exercises. The option -O3 is a standard optimisation ag for speeding up execution of the code. ## Fortran compiler and options FC= pgf90 -O3 -mp ## Object files OBJ= main.o \ sub.o ## Compile execname: $(OBJ) $(FC) -o $@ $(OBJ) .f.o: $(FC) -c $< ## Clean out object files and the executable. clean: rm *.o execname
Job Submission
Batch processing is very important on the HPC service, because it is the only way of accessing the back-end. Interactive access is not allowed. For doing timing runs you must use the back-end. To do this, you should submit a batch job as follows, for example using 4 threads: qsub -cwd -pe omp 4 scriptfile where scriptfile is a shell script containing 9
#! /usr/bin/bash export OMP_NUM_THREADS=4 ./execname The -cwd ag ensures the output les are written to the current directory. The -pe ag requestes 4 processors. Note that with OpenMP codes the number of processors requested should be the same as the environment variable OMP NUM THREADS. This can be ensured by using: export OMP_NUM_THREADS=$NSLOTS You can monitor your jobs status with the qstat command, or the qmon utility. Jobs can be deleted with qdel (or via the qmon utility).
10