Lecture2 Cuda Basic 2010
Lecture2 Cuda Basic 2010
Lecture 2
Introduction to CUDA
C/C++
OpenCL
DirectX Compute
Fortran
Java
Python
.Net
ATI’s Compute “Solution”
…
(GPU HW, Driver, ISA…)
CUDA - C with no shader limitations
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
Thread Processor
TF TF TF TF TF TF TF TF
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2
FB FB FB FB FB FB
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA mode – A Device Example
• Processors execute computing threads
• New operating mode/HW interface for computing
Host
Input Assembler
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
Texture Texture Texture Texture Texture Texture Texture Texture
Global Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA C - extensions
• Declspecs
__device__ float filter[N];
– global, device,
shared, local, __global__ void convolve (float *image) {
constant
__shared__ float region[M];
...
…
float a = input[threadIdx];
float b = func(a);
output[threadIdx] = b;
…
… … …
float a = input[threadIdx]; float a = input[threadIdx]; float a = input[threadIdx];
float b = func(a);
output[threadIdx] = b;
…
float b = func(a);
output[threadIdx] = b;
…
… float b = func(a);
output[threadIdx] = b;
…
• Simplifies memory
addressing when processing
multidimensional data
– Image processing
– Solving PDEs on volumes
– …
int main()
{
// Run ceil(N/256) blocks of 256 threads each
vecAdd<<<ceil(N/256), 256>>>(d_A, d_B, d_C, n);
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAdd(float* A, float* B, float* C, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
} Host Code
int main()
{
// Run ceil(N/256) blocks of 256 threads each
vecAdd<<<ceil(N/256), 256>>>(d_A, d_B, d_C, n);
}
__host__
Kernel execution in a nutshell
__global__
void example() void saxpy(int n, float a,
{ float *x, float *y)
int B = 128, {
P = ceil(n/B); int i = blockIdx.x * blockDim.x
saxpy<<<P,B>>>(n, a, x, y); + threadIdx.x;
}
if( i<n ) y[i] = a * x[i] + y[i];
}
Kernel
Blk 0 Blk
•••
p-1
GPU
M0 Mk
•••
RAM RAM
CUDA Memory Model Overview
• Global memory
– Main means of
Grid
communicating R/W
Data between host and Block (0, 0) Block (1, 0)
threads
– Long latency access Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Grid
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Host Global
Memory
cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Host-Device Data Transfer
• cudaMemcpy()
– memory data transfer Grid
– Host to Host
– Host to Device Host Global
Memory
– Device to Host
– Device to Device
• Asynchronous transfer
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Host-Device Data Transfer
(cont.)
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic constants
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
CUDA Keywords
• Without tiling:
WIDTH
– One thread calculates one element
of P
– M and N are loaded WIDTH times
from global memory
M P
WIDTH
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Memory Layout of a Matrix in C
M0,0! M1,0! M2,0! M3,0!
M!
M0,0! M1,0! M2,0! M3,0! M0,1! M1,1! M2,1! M3,1! M0,2! M1,2! M2,2! M3,2! M0,3! M1,3! M2,3! M3,3!
WIDTH
for (int i = 0; i < Width; ++i)
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * widthM+ j]; P
sum += a * b; i!
}
P[i * Width + j] = sum;
WIDTH
}
} k!
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 2: Input Matrix Data Transfer
(Host-side Code)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;!
…
1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
WIDTH
} tx!
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd
ty! ty!
WIDTH
k! tx!
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 5: Kernel Invocation
(Host-side Code)
matrix Pd
– Each thread computes one
element of Pd Thread
)2 ,2(
• Each thread
– Loads a row of matrix Md
– Loads a column of matrix Nd
– Perform one multiply and
addition for each pair of Md
and Nd elements
– Compute to off-chip memory 48
access ratio close to 1:1 (not
very high)
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block Pd
Md
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 7: Handling Arbitrary Sized Square
Matrices
• Have each 2D thread block to Nd
WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/
TILE_WIDTH)2 blocks Pd
Md
WIDTH
ty!
TILE_WIDTH is greater
than max grid size (64K)!! bx! tx!
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
A Small Example
• Have each 2D thread block to compute a (TILE_WIDTH)2
sub-matrix (tile) of the result matrix
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
Block(0,0) Block(1,0)
Block(0,1) Block(1,1)
© David Kirk/NVIDIA and Wen-mei W. Hwu! 37
Urbana, Illinois, August 10-14, 2009!
A Small Example: Multiplication
Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}© David Kirk/NVIDIA and Wen-mei W. Hwu! 39
Urbana, Illinois, August 10-14, 2009!
Revised Step 5: Kernel Invocation
(Host-side Code)
GPU … GPU
Target code
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010! 42
Compilation
• Any source file containing CUDA language
extensions must be compiled with NVCC
• NVCC is a compiler driver
– Works by invoking all the necessary tools and
compilers like cudacc, g++, cl, ...
• NVCC outputs:
– C code (host CPU Code)
• Must then be compiled with the rest of the application using another tool
– PTX
• Object code directly
• Or, PTX source, interpreted at runtime