Lecture 2
Lecture 2
Lecture 2:
Introduction to CUDA
1
Credits
• The material used in this presentation is based on code
available in:
– the Tutorial on CUDA in Dr. Dobbs Journal
– Supercomputing 2008 Education Program
– David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE
498AL Spring 2010, University of Illinois, Urbana-Champaign
– Andrew Bellenir’s code for matrix multiplication
– Igor Majdandzic’s code for Voronoi diagrams
– NVIDIA’s CUDA programming guide
Software Requirements/Tools
• Occupancy calculator
• Visual profiler
What is Cuda?
• CUDA is a scalable parallel programming model
and a software environment for parallel computing
– Minimal extensions to familiar C/C++ environment
– Heterogeneous serial-parallel programming model
• NVIDIA’s TESLA architecture accelerates CUDA
– Expose the computational horsepower of NVIDIA GPUs
– Enable GPU computing
• CUDA also maps well to multicore CPUs
4
• CUDA Architecture
– Expose general-purpose GPU computing as first-class
capability
– Retain traditional DirectX/OpenGL graphics performance
• CUDA C
– Based on industry-standard C
– A handful of language extensions to allow heterogeneous
programs
– Straightforward APIs to manage devices, memory, etc.
5
CUDA C Prerequisites
6
The Basics
7
A GPU is a specialized computer
• We need to allocate space in the graphic card’s
memory for the variables.
• The graphic card does not have I/O devices, hence we
need to copy the input data from the memory in the
host computer into the memory in the graphic card,
using the variable allocated in the previous step.
• We need to specify code to execute.
• Copy the results back to the memory in the host
computer.
Supercomputing 2008
Education Program
CUDA – C with no shader limitations!
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code
array
Host’s Memory GPU Card’s Memory
Supercomputing10
2008
Education Program
Allocate Memory in the GPU card
array array_d
Host’s Memory GPU Card’s Memory
Supercomputing11
2008
Education Program
Copy content from the host’s memory to the
GPU card memory
array array_d
Host’s Memory GPU Card’s Memory
Supercomputing12
2008
Education Program
Execute code on the GPU
GPU MPs
array array_d
Host’s Memory GPU Card’s Memory
Supercomputing13
2008
Education Program
Copy results back to the host memory
array array_d
Host’s Memory GPU Card’s Memory
Supercomputing14
2008
Education Program
CUDA Devices and Threads
• A compute device
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)
– Runs many threads in parallel
– Is typically a GPU but can also be another type of parallel processing
device
• Data-parallel portions of an application are expressed as device
kernels which run on many threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 15
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Extended C
• Type Qualifiers
__device__ float filter[N];
– global, device, shared,
local, constant __global__ void convolve (float *image) {
image[j] = result;
• Runtime API }
cudacc
EDG C/C++ frontend
Open64 Global Optimizer
OCG gcc / cl
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
… … …
float x = float x = float x =
input[threadID];
float y = func(x);
output[threadID] = y;
input[threadID];
float y = func(x);
output[threadID] = y;
… input[threadID];
float y = func(x);
output[threadID] = y;
… … …
Supercomputing 2008
Education Program
Grid Size and Block Size
• Programmers need to specify:
– The grid size: The size and shape of the data that the
program will be working on
– The block size: The block size indicates the sub-area of the
original grid that will be assigned to an MP (a set of stream
processors that share local memory)
Supercomputing 2008
Education Program
In the GPU:
Processing Elements
Array Elements
Block 0 Block 1
Supercomputing22
2008
Education Program
Block IDs and Thread IDs
Host Device
• Each thread uses IDs to decide Grid 1
what data to work on Kernel Block Block
– Block ID: 1D or 2D 1 (0, 0) (1, 0)
Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 23
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Figure 3.2. An Example of CUDA Thread Org
CUDA Memory Model Overview
• Global memory
– Main means of
communicating R/W Grid
allocated object
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
• Size of of allocated object
• cudaFree() Host Global
Memory
TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);
cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 27
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
CUDA Host-Device Data Transfer
• cudaMemcpy()
– memory data transfer Grid
– Host to Host
– Host to Device Host Global
Memory
– Device to Host
– Device to Device
• Asynchronous transfer
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 28
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
CUDA Host-Device Data Transfer
(cont.)
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic constants
34
CUDA Built-in Device Variables
• All __global__ and __device__ functions have
access to these automatically defined variables
– dim3 gridDim;
• Dimensions of the grid in blocks (at most 2D)
– dim3 blockDim;
• Dimensions of the block in threads
– dim3 blockIdx;
• Block index within the grid
– dim3 threadIdx;
• Thread index within the block
35
A More Complex Example: add()
36
A More Complex Example: main()
İnt main( void ) {
int a, b, c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; // device copies of a, b, c
int size = sizeof( int); // we need space for an integer
a = 2;
b = 7;
37
A More Complex Example: main()(cont)
// copy inputs to device
cudaMemcpy( dev_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, &b, size, cudaMemcpyHostToDevice);
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
return0;
}
38
Parallel Programming in CUDA C
• But wait…GPU computing is about massive
parallelism
• So how do we run code in parallel on the device?
• Solution lies in the parameters between the triple
angle brackets:
add<<< 1, 1 >>>( dev_a, dev_b, dev_c);
add<<< N, 1 >>>( dev_a, dev_b, dev_c);
• Instead of executing add()once, add()executed Ntimes
in parallel
39
Parallel Programming in CUDA C
• Each block adds a value from a[]and b[], storing the result in c[]:
__global__ void add( int*a, int*b, int*c ) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
• By using blockIdx.xto index arrays, each block handles different indices
40
Parallel Programming in CUDA C
• We write this code:
__global__ void add( int*a, int*b, int*c ) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
• This is what runs in parallel on the device:
Block 0
– c[0] = a[0] + b[0];
Block 1
– c[1] = a[1] + b[1];
Block 2
– c[2] = a[2] + b[2];
Block 3
– c[3] = a[3] + b[3];
41
#define N 512
int main( void ) {
int *a, *b, *c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; // device copies of a, b, c
int size = N *sizeof( int); // we need space for 512 integers
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
a = (int*)malloc( size );
b = (int*)malloc( size );
c = (int*)malloc( size );
random_ints( a, N );
random_ints( b, N );
42
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice);
// launch add() kernel with N parallel blocks
add<<< N, 1 >>>( dev_a, dev_b, dev_c);
// copy device result back to host copy of c
cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost);
free( a ); free( b ); free( c );
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
return0;
}
43
Threads
44
intmain( void ) {
int*a, *b, *c; //host copies of a, b, c
int*dev_a, *dev_b, *dev_c; //device copies of a, b, c
intsize = N * sizeof( int); //we need space for 512 integers
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
a = (int*)malloc( size );
b = (int*)malloc( size );
c = (int*)malloc( size );
random_ints( a, N );
random_ints( b, N );
46
Using Threads AndBlocks
47
Indexing Arrays With Threads And
Blocks
• No longer as simple as just using threadIdx.xor blockIdx.xas
indices
• To index array with 1 thread per entry (using 8 threads/block)
50
Parallel Addition (Blocks/Threads):
main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
intmain( void ) {
int*a, *b, *c; // host copies of a, b, c
int*dev_a, *dev_b, *dev_c; // device copies of a, b, c
intsize = N * sizeof( int); // we need space for N integers
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
a = (int*)malloc( size );
b = (int*)malloc( size );
c = (int*)malloc( size );
random_ints( a, N );
random_ints( b, N );
51
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice);
// launch add() kernel with blocks and threads
add<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a,
dev_b, dev_c);
// copy device result back to host copy of c
cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost);
free( a ); free( b ); free( c );
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
Supercomputing 2008
Education Program
simple.c
•
#include <stdio.h>
#define SIZEOFARRAY 64
extern void fillArray(int *a,int size);
/* The main program */
int main(int argc,char *argv[])
{
/* Declare the array that will be modified by the GPU */
int a[SIZEOFARRAY];
int i;
/* Initialize the array to 0s */
for(i=0;i < SIZEOFARRAY;i++) {
a[i]=i;
}
/* Print the initial array */
printf("Initial state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
/* Call the function that will in turn call the function in the GPU that will fill
the array */
fillArray(a,SIZEOFARRAY);
/* Now print the array after calling fillArray */
printf("Final state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
return 0;
}
Supercomputing 2008
Education Program
simple.cu
• simple.cu contains two functions
– fillArray(): A function that will be executed on the host and
which takes care of:
• Allocating variables in the global GPU memory
• Copying the array from the host to the GPU memory
• Setting the grid and block sizes
• Invoking the kernel that is executed on the GPU
• Copying the values back to the host memory
• Freeing the GPU memory
Supercomputing 2008
Education Program
fillArray (part 1)
#define BLOCK_SIZE 32
extern "C" void fillArray(int *array,int arraySize){
/* a_d is the GPU counterpart of the array that exists on the host memory */
int *array_d;
Supercomputing 2008
Education Program
fillArray (part 2)
/* execution configuration... */
/* Indicate the dimension of the block */
dim3 dimblock(BLOCK_SIZE);
/* Indicate the dimension of the grid in blocks */
dim3 dimgrid(arraySize/BLOCK_SIZE);
/* actual computation: Call the kernel, the function that is */
/* executed by each and every processing element on the GPU
card */
cu_fillArray<<<dimgrid,dimblock>>>(array_d);
/* read results back: */
/* Copy the results from the GPU back to the memory on the host
*/
result =
cudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDevice
ToHost);
/* Release the memory on the GPU card */
cudaFree(array_d);
}
Supercomputing 2008
Education Program
simple.cu (cont.)
• The other function in simple.cu is
– cu_fillArray()
• This is the kernel that will be executed in every stream processor in
the GPU
• It is identified as a kernel by the use of the keyword: __global__
• This function uses the built-in variables
– blockIdx.x and
– threadIdx.x
to identify a particular position in the array
Supercomputing 2008
Education Program
cu_fillArray
__global__ void cu_fillArray(int *array_d){
int x;
/* blockIdx.x is a built-in variable in CUDA
that returns the blockId in the x axis
of the block that is executing this block of code
threadIdx.x is another built-in variable in CUDA
that returns the threadId in the x axis
of the thread that is being executed by this
stream processor in this particular block
*/
x=blockIdx.x*BLOCK_SIZE+threadIdx.x;
array_d[x]+=array_d[x];
}
Supercomputing 2008
Education Program
A Simple Running Example
Matrix Multiplication
• A simple matrix multiplication example that illustrates
the basic features of memory and thread management
in CUDA programs
– Leave shared memory usage until later
– Local, register usage
– Thread ID usage
– Memory data transfer API between host and device
– Assume square matrix for simplicity
• Without tiling:
WIDTH
– One thread calculates one element of P
– M and N are loaded WIDTH times from
global memory
M P
WIDTH
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 61
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Memory Layout of a Matrix in C
M0,0 M0,1 M0,2 M0,3
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3
WIDTH
for (int i = 0; i < Width; ++i)
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * widthM+ j]; P
sum += a * b; i
}
P[i * Width + j] = sum;
WIDTH
}
} k
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH
63
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 2: Input Matrix Data Transfer
(Host-side Code)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;
…
1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
WIDTH
} tx
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd
ty ty
WIDTH
k tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 67
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 5: Kernel Invocation
(Host-side Code)
WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
(WIDTH/TILE_WIDTH)2 blocks Pd
Md
WIDTH
WIDTH/TILE_WIDTH is
greater than max grid size bx tx
(64K)!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 70
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Some Useful Information on
Tools
G80 … GPU
Target code
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 72
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Compilation
• Any source file containing CUDA language
extensions must be compiled with NVCC
• NVCC is a compiler driver
– Works by invoking all the necessary tools and
compilers like cudacc, g++, cl, ...
• NVCC outputs:
– C code (host CPU Code)
• Must then be compiled with the rest of the application using another tool
– PTX
• Object code directly
• Or, PTX source, interpreted at runtime