CUDA Tutorial
CUDA Tutorial
7)
History of GPUs
• VGA in early 90’s -- A memory controller and display generator
connected to some (video) RAM
• By 1997, VGA controllers were incorporating some acceleration functions
• In 2000, a single chip graphics processor incorporated almost every detail
of the traditional high-end workstation graphics pipeline
- Processors oriented to 3D graphics tasks
- Vertex/pixel processing, shading, texture mapping, rasterization
71
Contemporary PC architecture
72
Basic unified GPU architecture
Streaming
Multiprocessor
special
function
unit
73
ROP = Raster Opertastions Pipeline
TPC = Texture Processing Cluster
Tutorial CUDA
Cyril Zeller
NVIDIA Developer Technology
Note: These slides are truncated
from a longer version which is
publicly available on the web
Enter the GPU
threadID 0 1 2 3 4 5 6 7
… … …
float x = float x =
…
float x =
input[threadID]; input[threadID]; input[threadID];
float y = func(x); float y = func(x); float y = func(x);
output[threadID] = y; output[threadID] = y; output[threadID] = y;
… … …
Kernel grid
2-Core Device 4-Core Device
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 6 Block 7
Block 4 Block 5
Device 0
memory
Host memory cudaMemcpy()
Device 1
memory
Per-block
Block
Shared • On-chip, small
Memory • Fast
Per-device
Memory
• Persistent across
kernel launches
Kernel 1 ...
• Kernel I/O
Host Device
GPU
DRAM Multiprocessor
CPU
Local Multiprocessor
Memory
Multiprocessor
Global
DRAM Chipset Memory
Registers
Shared Memory
Multiprocessor
Thread
Processors
Double
Shared
Memory
PTX Code
Virtual
G80 … GPU
Target code
© NVIDIA Corporation 2009
GPU Memory Allocation / Release
int n = 1024;
int nbytes = 1024*sizeof(int);
int *a_d = 0;
cudaMalloc( (void**)&a_d, nbytes );
cudaMemset( a_d, 0, nbytes);
cudaFree(a_d);
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
Part II - Kernels
Thread Hierarchy
__host__
Host
Threads and blocks have Device
IDs Grid 1
Kernel
2
Simplifies memory
addressing when Block (1, 1)
multidimensional data
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
kernel<<<grid, block>>>(...);
kernel<<<32,512>>>(...);
dim3 gridDim;
Dimensions of the grid in blocks (at most 2D)
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block
Grid
blockIdx.x 0 1 2
blockDim.x = 5
threadIdx.x 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
blockIdx.x*blockDim.x
+threadIdx.x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
…
// copy data from host to device
cudaMemcpy(a_d, a_h, numBytes, cudaMemcpyHostToDevice);
__device__
Stored in global memory (large, high latency, no cache)
Allocated with cudaMalloc (__device__ qualifier implied)
Accessible by all threads
Lifetime: application
__shared__
Stored in on-chip shared memory (very low latency)
Specified by execution configuration or at compile time
Accessible by all threads in the same thread block
Lifetime: thread block
Unqualified variables:
Scalars and built-in vector types are stored in registers
Arrays may be in registers or local memory