Cuda Lab Manual
Cuda Lab Manual
LAB WORKBOOK
August 2009
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
TABLE OF CONTENTS
1 INTRODUCTION ........................................................................................................................................... 4 1.1 1.2 1.3 GENERAL PURPOSE GRAPHIC PROCESSING UNIT (GPGPU) ..................................................................... 4 COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) .............................................................................. 4 MAIN OBJECTIVES ..................................................................................................................................... 4
2 SETTING UP CUDA DEVELOPMENT ENVIRONMENT ..................................................................... 5 2.1 2.2 2.3 2.4 2.5 VERIFYING THAT YOU HAVE A CUDA-CAPABLE SYSTEM .......................................................................... 5 DOWNLOADING CUDA DEVELOPMENT COMPONENTS............................................................................. 6 INSTALLING CUDA SOFTWARE COMPONENTS .......................................................................................... 6 VERIFYING CUDA INSTALLATIONS ........................................................................................................... 8 GENERAL PROCEDURE OF PROGRAMMING IN CUDA .............................................................................. 11
3 PROGRAMMING IN CUDA ........................................................................................................................ 11 3.1 3.2 3.3 3.4 PROGRAMMING EXERCISE 1 (HELLO WORLD) ......................................................................................... 11 PROGRAMMING EXERCISE 2 (MATRIX MULTIPLICATION) ........................................................................ 13 PROGRAMMING EXERCISE 3 (NUMERICAL CALCULATION OF VALUE OF PI ()) ........................................ 17 PROGRAMMING EXERCISE 4 (PARALLEL SORT) ........................................................................................ 20
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
LAB WORKBOOK
This workbook is written for assisting the students of Summer Short Course on Parallel Programming With CUDA at Al-Khawarzmi Institute of Computer Science (KICS). This edition was prepared over a short period of two months and was finalized in July 2009. The contents of this document have been compiled from various academic resources to expose the students to Genral Purpose Graphic Processing Units (GPGPU) and Nvidias Compute Unified Device Architecture (CUDA) in a hands-on fashion. For Further information, please contact the KICS at UET, Lahore:
Telephone: (042) 992 50450 Fax: (042) 992 50246 Email: [email protected]
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
Introduction
Multicore and Many-core systems provide within-the-box parallel processing capabilities. Computing task that were run on supercomputers in past are now able to run on desktops provided that we know the capabilities of available hardware, and software techniques to exploit these available resources.
1.1
Graphic Processing Unit (GPU) available on commodity video adapters has evolved into highly parallel, multithreaded, Many-core processor, thanks to gaming industry. These GPUs have huge computational power as well as very high memory bandwidth that can be exploited by general purpose high performance applications. These programmable GPU are also known as general purpose graphic processing units (GPGPU, from now onward we will use term GPU). GPU is specialized for compute-intensive, highly parallel computation just like graphics rendering is done. GPU is based on SIMD architectural model and utilized by data-parallel programming model.
At the end of this lab, you should be able to: Setup CUDA development environment Write, compile and run CUDA programs on Nvidia device as well as on x86 multicore systems in device emulation mode. Use data parallel programming model
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
To use CUDA on your system, you will need a supported version of Linux with a gcc compiler and toolchain, CUDA software (available freely at http://www.nvidia.com/cuda) and a CUDA-capable GPU. If you do not have a CUDA-capable GPU, you can still use CUDA in device emulation mode. Device emulation mode is basically for debugging purposes and obviously, does not offer as much performance as with a CUDAcapable GPU. So device emulation mode should not be used for release versions and performance tuning. After installing CUDA software, we need to test our CUDA build environment by compiling and running one or more sample programs (available in CUDA SDK). This will validate that hardware and software are running and communicating correctly.
2.1.1
Note: Skip this section if your system is not equiped wih a CUDA-capable Nvidia GPU.
[root@gm gm]# lspci |grep -i nVidia
01:00.0 VGA compatible controller: nVidia Corporation GeForce 9600M GT (rev a1)
[root@gm gm]#
If you do not see anything, either you do not have an Nvidia graphic adapter or you have to update PCI hardware database, maintained by Linux, using following command. If your network connection is fine, output should look like below.
[root@gm gm]# update-pciids
% Total 100 148k % Received % Xferd 100 148k 0 0 Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 6241k 0 --:--:-- --:--:-- --:--:-- 6767k
2.1.2
Current version (2.2) of CUDA software components requires an x86-based Linux distribution. Following command checks distribution and release number of running system,
[root@gm gm]# uname -i && cat /etc/*release i386 Fedora release 10 (Cambridge) Fedora release 10 (Cambridge) Fedora release 10 (Cambridge) [root@gm gm]#
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
Output shows that running system is 32-bit (i386) Fedora version 10. On a 64-bit system running in 64-bit mode the typical output will be x86_64. Version 2.2 of CUDA development tools support only following distributions: Red Hat Enterprise Linux 4.3-4.7, 5.0-5.3 SUSE Enterprise Desktop 10-SP2 Open SUSE 11.0 or 11.1 Fedora 9 or 10 Ubuntu 8.04 or 8.10 You should frequently visit CUDA download page for updates because other distributions are promised to be supported latter.
2.1.3
Verifying gcc
Current CUDA development tools supports version 3.4, 4.x of gcc. You can check the version of currently installed gcc by issuing the following command:
[root@gm gm]# gcc --version gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7) Copyright (C) 2008 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [root@gm gm]#
2.3.1
You need to shutdown x server before installing the driver (best way is to change id:5:initdefault: to id:3:initdefault: in /etc/inittab file and reboot). You will get console only (No graphics). Secondly, you must have source code of running kernel (if needed) that can be installed by issuing following command:
[root@gm gm]# yum install kernel-devel
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
More information about driver installation is available on http://us.download.nvidia.com/XFree86/Linux-x86/1.0-9755/README/index.html. To install driver, first of all exit the GUI (ctl-alt-backspace). On available command line issue the following commands to turn off x windows as a superuser, install driver and restart GUI environment, respectively.
[root@gm gm]# password: [root@gm gm]# [root@gm gm]# [root@gm gm]# [root@gm gm]# su /sbin/init 3 cd <directory containing downloaded .run files> ./NVIDIA-Linux-x86-185.18.14-pkg1.run /sbin/init 5
You can also issue the following command to start the GUI environment,
[root@gm gm]# startx
Make sure your internet connection is working fine. Follow the instruction displayed on your screen.
Note: You can verify driver release by running [root@gm gm]# /usr/bin/nvidia-settings the following command,
2.3.2
2.3.3
You can make these settings permanent by putting the above mentioned commands to ~/.bashrc
2.3.4
2.3.5
[root@gm gm]# cd <directory containing downloaded .run files> [root@gm gm]# ./cudasdk_2.21_linux.run
2.3.6
[root@gm gm]# cd <directory containing downloaded .run files> [root@gm gm]# ./cudagdb_2.2_linux_32_rhel5.3.run
2.4.1
LD_LIBRARY_PATH=/usr/local/cuda/lib/:
LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:do=00;35:bd=40;33;01:cd=40;3 3;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44: ex=00;32:*.tar=00;31:*.tgz=00;31:*.svgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.lz ma=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.dz=00;31:*.gz=00;31:*.bz2=00;31:*.tbz2=00;3 1:*.bz=00;31:*.tz=00;31:*.deb=00;31:*.rpm=00;31:*.jar=00;31:*.rar=00;31:*.ace=00;31:*. zoo=00;31:*.cpio=00;31:*.7z=00;31:*.rz=00;31:*.jpg=00;35:*.jpeg=00;35:*.gif=00;35:*.bm p=00;35:*.pbm=00;35:*.pgm=00;35:*.ppm=00;35:*.tga=00;35:*.xbm=00;35:*.xpm=00;35:*.tif= 00;35:*.tiff=00;35:*.png=00;35:*.mng=00;35:*.pcx=00;35:*.mov=00;35:*.mpg=00;35:*.mpeg= 00;35:*.m2v=00;35:*.mkv=00;35:*.ogm=00;35:*.mp4=00;35:*.m4v=00;35:*.mp4v=00;35:*.vob=0 0;35:*.qt=00;35:*.nuv=00;35:*.wmv=00;35:*.asf=00;35:*.rm=00;35:*.rmvb=00;35:*.flc=00;3 5:*.avi=00;35:*.fli=00;35:*.gl=00;35:*.dl=00;35:*.xcf=00;35:*.xwd=00;35:*.yuv=00;35:*. svg=00;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.m p3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: SSH_AUTH_SOCK=/tmp/keyring-qpkd1F/ssh GNOME_KEYRING_SOCKET=/tmp/keyring-qpkd1F/socket USERNAME=gm SESSION_MANAGER=local/unix:@/tmp/.ICE-unix/2747,unix/unix:/tmp/.ICE-unix/2747 DESKTOP_SESSION=gnome PATH=/usr/local/cuda/bin/:/usr/kerberos/sbin:/usr/lib/qt3.3/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin :/home/gm/bin MAIL=/var/spool/mail/gm PWD=/home/gm/Desktop XMODIFIERS=@im=imsettings GNOME_KEYRING_PID=2745 LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=gnome SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass HOME=/root SHLVL=3 Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
no_proxy=localhost,127.0.0.0/8 GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=gm QTLIB=/usr/lib/qt-3.3/lib DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbusE9ZoYtPeZC,guid=4328bc8674e6eb0b12d4ef874a5dcc87 LESSOPEN=|/usr/bin/lesspipe.sh %s DISPLAY=:0.0 G_BROKEN_FILENAMES=1 XAUTHORITY=/root/.xauth5fdjoq COLORTERM=gnome-terminal _=/usr/bin/env OLDPWD=/home/gm
2.4.2
nvcc is compiler driver for CUDA programs. It calls gcc compiler for C code and NVIDIA PTX compiler foe CUDA code. To verify, enter one of the following commands:
[root@gm gm]# which nvcc /usr/local/cuda/bin/nvcc [root@gm ~]# nvcc V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2009 NVIDIA Corporation Built on Thu_Apr__9_07:37:20_PDT_2009 Cuda compilation tools, release 2.2, V0.2.1221 [root@gm ~]#
2.4.3
2.4.4
2.4.5
SELinux-enabled systems, you may need to disable this security feature using setenforce command.
[root@gm gm]# setenforce 0
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
[root@gm gm]# cd <NVIDIA_CUDA_SDK>/bin/linux/emurelease [root@gm release]# ./deviceQuery CUDA Device Query (Runtime API) version (CUDART static linking) There is 1 device supporting CUDA Device 0: "GeForce 9600M GT" CUDA Capability Major revision number: CUDA Capability Minor revision number: Total amount of global memory: Number of multiprocessors: Number of cores: Total amount of constant memory: Total amount of shared memory per block: Total number of registers available per block: Warp size: Maximum number of threads per block: Maximum sizes of each dimension of a block: Maximum sizes of each dimension of a grid: Maximum memory pitch: Texture alignment: Clock rate: Concurrent copy and execution: Run time limit on kernels: Integrated: Support host page-locked memory mapping: Compute mode: threads can use this device simultaneously) Test PASSED Press ENTER to exit...
1 1 536150016 bytes 4 32 65536 bytes 16384 bytes 8192 32 512 512 x 512 x 64 65535 x 65535 x 1 262144 bytes 256 bytes 1.25 GHz Yes Yes No No Default (multiple host
To test that system and CUDA-capable device communicate correctly, run following
[root@gm release]# ./bandwidthTest Running on...... device 0:GeForce 9600M GT Quick Mode Host to Device Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1756.6 Quick Mode Device to Host Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1168.8 Quick Mode Device to Device Bandwidth . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 10762.2 &&&& Test PASSED
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
Start using CUDA to build your own high performance applications. NVIDIA CUDA Programming Guide, located in /usr/local/cuda/doc/ is your next step in this course.
2.5
You can use any text editor to write your CUDA source code for your program. Save it with .cu extension. Then issue the following commnd (assuming environment variables are properly set, as described above): [root@gm <dir>]# nvcc o <executeable_name> -deviceemu <program_name>.cu [root@gm <dir>]# ./<executeable_name>
Replace contents contained in < > with actual names. -deviceemu compiles code that is expected to run on CPU only.
Programming in CUDA
CUDA comes with a software environment that allows developers to use C as a high-level programming language. This section is composed of programming exercises for hands on practice. Problem partitionaing in terms of threads and thread Blocks, and organization of thread blocks in one or more block grids is the main challenge faced by CUDA programmers. Following programming exercises are designed to understand this concept of problem orchestration. Complicated details of CUDA like compilation steps, generated files, different file formats, and very precise and efficient use of different memory hierarchy etc. are out of scope of this activity. You will gradually learn these concepts. Most important is to tackle problem orchestration and to get output of your simple programs.
3.1.1
Lab Objectives
Objectives of this lab experiment include: 1. 2. 3. 4. Learning about the general structure of a CUDA program Learning the concept of kernel, kernel invocation, hierarchical thread grouping. Learning the concept of threadIdx, blockIdx and blockDim. Compiling and running CUDA code in device emulation mode
3.1.2
Setup
Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in section 2.3.3.
*/ #include <cuda.h> #include <stdio.h> #include <stdlib.h> __global__ void printhello() { int thid = blockIdx.x * blockDim.x + threadIdx.x; printf("Thread%d: Hello World!\n", thid); } int main() { printhello<<<5,10>>>(); return 0; }
3.1.3
Procedure
Write this simple program in any text editor and save it with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with kernel invocation statement by changing the values of dimGrid and dimBlock where general kernel invocation statement is kernel<<<dimGrid, dimBlock>>> ( ). Try to figure out how the ID of a thread will change by changing dimBlock and dimGrid.
3.1.4
Conclusions
3.1.5
Lab instructors remark whether the student finished the work to meet the lab objectives.
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
3.2.1
Lab Objectives
Objectives of this lab experiment include: 5. 6. 7. 8. 9. 10. Learning the application of CUDA to linear algebra problems Learning how to partion a large problem in to subproblems Learning how to exploit the thread and block IDs for useful calculations Learning how to download parallel portion of code to device Learning how to use device memory Understanding hetrogeneous programming
3.2.2
Setup
Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in section 2.3.3.
/* * * * * * */
File: matrix_mul.cu Author: Ghulam Mustafa Created on July 31,2009, 7:30 PM Code is adapted from Nvidia CUDA Programming Guide ver 2.2.1 Matrices are stored in row-major order:M(row, col) = M.ents[row*M.w + col]
#include <cuda.h> #include <stdio.h> #include <stdlib.h> #define BLOCK_SZ 2 #define DBG 1 //Order #define #define //Order of Xc Xr of Matrix X = (Xr x Xc) (2 * BLOCK_SZ) (3 * BLOCK_SZ) Matrix Y = (Yr x Yc)
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
Yc Yr of Zc Zr
#define N (Zr*Zc) typedef struct Matrix{ int r,c; float* elements; } matrix; void populate_matrix(matrix*); void print_matrix(matrix); __global__ void matrix_mul_krnl(matrix A, matrix B, matrix C) { float C_entry = 0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int i; for (i = 0; i < A.c; i++) C_entry += A.elements[row * A.c + i] * B.elements[i * B.c + col]; C.elements[row * C.c + col] = C_entry; } int main() { matrix X, Y, Z; X.r = Xr; Y.r = Yr; X.c = Xc; Y.c = Yc;
if(DBG) printf("C(%d,%d) = A(%d,%d) x B(%d,%d)\n----------------------\n" ,Z.r,Z.c, X.r,X.c, Y.r,Y.c); size_t size_Z = Z.c * Z.r * sizeof(float); Z.elements = (float*) malloc(size_Z); populate_matrix(&X); populate_matrix(&Y); printf("Matrix A (%d,%d)\n",X.r,X.c); print_matrix(X); printf("Matrix B(%d,%d)\n",Y.r,Y.c); print_matrix(Y); matrix d_A; d_A.c = X.c; d_A.r = X.r; size_t size_A = X.c * X.r * sizeof(float); cudaMalloc((void**)&d_A.elements, size_A); cudaMemcpy(d_A.elements, X.elements, size_A, cudaMemcpyHostToDevice); matrix d_B; d_B.c = Y.c;
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
d_B.r = Y.r; size_t size_B = Y.c * Y.r * sizeof(float); cudaMalloc((void**)&d_B.elements, size_B); cudaMemcpy(d_B.elements, Y.elements, size_B,cudaMemcpyHostToDevice); // Allocate C in device memory matrix d_C; d_C.c = Z.c; d_C.r = Z.r; size_t size_C = Z.c * Z.r * sizeof(float); cudaMalloc((void**)&d_C.elements, size_C); dim3 dimBlock(BLOCK_SZ, BLOCK_SZ); dim3 dimGrid(Y.c / dimBlock.x, X.r / dimBlock.y); matrix_mul_krnl<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); // Read C from device memory cudaMemcpy(Z.elements, d_C.elements, size_C, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A.elements); cudaFree(d_B.elements); cudaFree(d_C.elements); printf("Matrix C(%d,%d)\n",Z.r,Z.c); print_matrix(Z); free (X.elements); free(Y.elements); free(Z.elements); } void populate_matrix(matrix* mat) { int dim = mat -> c * mat -> r; size_t sz = dim * sizeof(float); mat -> elements = (float*) malloc(sz); int i; for (i = 0; i < dim; i++) mat->elements[i] = (float)(rand()%1000); } void print_matrix(matrix mat) { int i, n = 0, dim; dim = mat.c * mat.r; for (i = 0; i < dim; i++) { if (i == mat.c * n) { printf("\n"); n++; } printf("%0.2f\t", mat.elements[i]); }
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
printf("\n============================================================\n"); }
3.2.3
Procedure
Write this program in any text editor and save it with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with matrices of different sizes as well as with different block sizes. Try to understand the concept of threadIdx, blockDim and blockIdx and how they are used in this context.
3.2.4
Conclusions
3.2.5
Lab instructors remark whether the student finished the work to meet the lab objectives.
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
3.3.1
Lab Objectives
Objectives of this lab experiment include: 11. Learning the application of CUDA to scientific (numerical) computing 12. Learning how to use thread IDs in the situations where sequence of executaion is important 13. Learning how to attack loops for parallelism
3.3.2
Setup
Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in section 2.3.3. /* * File: pi.cu * Author: Ghulam Mustafa * Created on July 31,2009, 7:30 PM */ #include <cuda.h> #include <stdio.h> #include <stdlib.h> typedef int int int } data; struct PI_data{ n; PerThrItr; nThr;
__global__ void calculate_PI(data d, float* s) { float sum, x, w; int itr,i,j; itr = d.PerThrItr; i = blockIdx.x * blockDim.x + threadIdx.x; int N = d.n-i; w = 1.0/(float)N; sum = 0.0; if (i < d.nThr) { for (j = i * itr; j < (i * itr+itr); j++) { x = w * (j-0.5); sum+= (4.0)/(1.0 + x*x); } s[i] = sum * w; } }
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
// Host code int main(int argc, char** argv) { printf("Usage: ./<progname> #intervals #Threads\n"); if(argc < 2) { printf("Usage: ./<progname> #itrations #Threads\n"); exit(1); } data pi_data; float PI=0; pi_data.n = atoi(argv[1]); pi_data.nThr = atoi(argv[2]); pi_data.PerThrItr = pi_data.n/pi_data.nThr; float *d_sum; float *h_sum; // Allocate vectors in device memory size_t size = pi_data.nThr * sizeof(float); cudaMalloc((void**)&d_sum, size); //Memory allocation on host h_sum = (float*) malloc(size); // cudaMemcpy(d_sum, h_sum, size, cudaMemcpyHostToDevice); int threads_per_block = 4; int blocks_per_grid; blocks_per_grid = (pi_data.nThr + threads_per_block 1)/threads_per_block; calculate_PI<<<blocks_per_grid, threads_per_block>>>(pi_data, d_sum); cudaMemcpy(h_sum, d_sum, size, cudaMemcpyDeviceToHost); int i; for (i = 0; i < pi_data.nThr; i++) PI+= h_sum[i]; //PI = PI * pi_data.n; printf("Using %d itrations, Value of PI is %f \n", pi_data.n, PI); // Free device memory cudaFree(d_sum); }
3.3.3
Procedure
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
4 = dx = 0 1 + x2
1
i =0
N 1
4 i 0 .5 1+ N
2
1 N
Using this technique each partial sum can be calculated in parallel. Write this program in any text editor and save
it with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with of different values of intervals and threads. Try to understand how threadIdx, blockDim and blockIdx are exploited here to keep the sequence of workflow.
3.3.4
Conclusions
3.3.5
Lab instructors remark whether the student finished the work to meet the lab objectives.
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
3.4.1
Lab Objectives
Objectives of this lab experiment include: 14. 15. 16. 17. Learning Bitonic sorting algorithm Learning how to use __shared__ construct Learning how to use __device__ construct Using Barrier syncrhonization for thread coordinateion support parallelism.
3.4.2
Setup
Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in section 2.3.3. /* * * * * */
parallel_sort.cu Ghulam Mustafa on July 31,2009, 7:30 PM adapted from Nvidia CUDA SDK sample projects ver 2.2.1
__device__ inline void swap(int & a, int & b) { int tmp = a; a = b; b = tmp; } __global__ static void bitonicSort(int * values) { extern __shared__ int shared[]; const unsigned int tid = threadIdx.x; // Copy input to shared mem. shared[tid] = values[tid]; __syncthreads(); // Parallel bitonic sort for (unsigned int k = 2; k <= NUM; k *= 2) { // Bitonic merge: for (unsigned int j = k / 2; j>0; j /= 2)
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
{ unsigned int ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) { swap(shared[tid], shared[ixj]); } } else { if (shared[tid] < shared[ixj]) { swap(shared[tid], shared[ixj]); } } } __syncthreads(); } } // Write result. values[tid] = shared[tid]; } int main(int argc, char** argv) { int values[NUM]; printf( "\nUnsorted Array\n==============\n"); for(int i = 0; i < NUM; i++) { values[i] = rand()%1000; printf("%d\t",values[i]); } printf("\n"); int * dvalues; cudaMalloc((void**)&dvalues, sizeof(int) * NUM); cudaMemcpy(dvalues, values, sizeof(int) * NUM, cudaMemcpyHostToDevice); bitonicSort<<<1, NUM, sizeof(int) * NUM>>>(dvalues); // check for any errors cudaMemcpy(values, dvalues, sizeof(int) * NUM, cudaMemcpyDeviceToHost); cudaFree(dvalues); bool passed = true; int i; printf( "\nSorted Array\n==============\n"); for( i = 1; i < NUM; i++) { if (values[i-1] > values[i]) passed = false; printf( "%d\t", values[i-1]);
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
3.4.3
Procedure
Write this program in any text editor and save it with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with values of NUM and check the status of test (last line of the output). Try to understand the concept of threadIdx, blockDim and blockIdx and how they are used in this context. To Compile & Run:
[root@gm gm]# nvcc o ll_sort -deviceemu parallel_sort.cu [root@gm gm]# ./ll_sort
3.4.4
Conclusions
3.4.5
Lab instructors remark whether the student finished the work to meet the lab objectives.
Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.