CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
2012
Outline
CUDA Installation
Kernel launches
Fermi
CUDA C languages
The extension of C/C++
Data parallel programming
Executing a thousands of processes in
parallel on GPUs
Cost of synchronization is not expensive
CUDA Installation
http://developer.nvidia.com/category/zone/cuda-zone
7
Compilation
Any source file containing CUDA language
extensions must be compiled with NVCC
NVCC outputs:
C code (host CPU Code)
PTX
8
Compilation
12
CUDA kernel and thread
Parallel portions of an application are executed on
the device as kernels
One kernel is executed at a time
Many threads execute each kernel
Differences between CUDA and CPU threads
CUDA threads are extremely lightweight
Very little creation overhead
Instant switching
CUDA uses 1000s of threads to achieve efficiency
Multi-core CPUs can use only a few
13
Kernel Memory Access
Registers
Global Memory
Kernel input and output data reside
here
Off-chip, large
Uncached
Shared Memory
Shared among threads in a single
block
On-chip, small
As fast as registers
The host can read & write global
memory but not shared memory
High Performance Computing Center 14
Hetegenerous Programming
int n = 1024;
int nbytes = 1024*sizeof(int);
int *d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
CUDA Installation
Kernel Launches
Hand-On
dim3 gridDim;
Dimensions of the grid in blocks (at most 2D)
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block
31
Using shared memory
if (blockIdx.x == blockreverseArray) {
shared_data[blockDim.x – (threadIdx.x+1)] = a[idx]
_syncthreads();
a[idx] = shared_data[threadIdx.x];
}
Part 1 (of 1): All you have to do is implement the body of the
kernel “reverseArrayBlock()”
37
THANK YOU