CS/EE 217
GPU Architecture and Programming
Lecture 2:
Introduction to CUDA C
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1
CUDA /OpenCL – Execution Model
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args); ...
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args); ...
2
From Natural Language to Electrons
Natural Language (e.g, English)
Algorithm
High-Level Language (C/C++…)
Compiler
Instruction Set Architecture
Microarchitecture
Circuits
Electrons
©Yale Patt and Sanjay Patel, From bits and bytes to gates and beyond
3
The ISA
• An Instruction Set Architecture (ISA) is a
contract between the hardware and the
software.
• As the name suggests, it is a set of
instructions that the architecture (hardware)
can execute.
4
A program at the ISA level
• A program is a set of instructions stored in
memory that can be read, interpreted, and
executed by the hardware.
• Program instructions operate on data stored
in memory or provided by Input/Output (I/O)
device.
5
The Von-Neumann Model
Memory
I/O
Processing Unit
Reg
ALU File
Control Unit
PC IR
6
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of
threads
– All threads in a grid run the same kernel code (SPMD)
– Each thread has an index that it uses to compute
memory addresses and make control decisions
0 1 2 254 255
…
i = blockIdx.x * blockDim.x +
threadIdx.x;
C_d[i] = A_d[i] + B_d[i];
…
7
7
Thread Blocks: Scalable Cooperation
• Divide thread array into multiple blocks
– Threads within a block cooperate via shared
memory, atomic operations and barrier
synchronization
– Threads in different blocks cannot cooperate
Thread Block 0 Thread Block 1 Thread Block N-1
0 1 2 254 255 0 1 2 254 255 0 1 2 254 255
… … …
i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x +
threadIdx.x; threadIdx.x; … threadIdx.x;
C_d[i] = A_d[i] + B_d[i]; C_d[i] = A_d[i] + B_d[i]; C_d[i] = A_d[i] + B_d[i];
… … …8
blockIdx and threadIdx
Host Device
• Each thread uses indices to Grid 1
decide what data to work on Kernel Block Block
1
– blockIdx: 1D, 2D, or 3D (0, 0) (1, 0)
(CUDA 4.0) Block Block
– threadIdx: 1D, 2D, or 3D (0, 1) (1, 1)
Grid 2
• Simplifies memory Kernel
2
addressing when processing Block (1, 1)
multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1)
– Image processing Thread Thread Thread Thread
– Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0)
– … Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy: NDVIA
9
Figure 3.2. An Example of CUDA Thread Org
Vector Addition – Conceptual View
vector
A A[0] A[1] A[2] A[3] A[4] A[N-1]
…
vector
B
B[0] B[1] B[2] B[3] B[4] … B[N-1]
+ + + + + +
vector C[0] C[1] C[2] C[3] C[4] C[N-1]
…
C
10
Vector Addition – Traditional C
Code
// Compute vector sum C = A+B
void vecAdd(float* A, float* B, float* C, int n)
{
for (i = 0, i < n, i++)
C[i] = A[i] + B[i];
}
int main()
{
// Memory allocation for A_h, B_h, and C_h
// I/O to read A_h and B_h, N elements
…
vecAdd(A_h, B_h, C_h, N);
}
11
Heterogeneous Computing vecAdd
Host Code
#include <cuda.h> Part 1
void vecAdd(float* A, float* B, float* C, int n)
{ Host Memory Device Memory
int size = n* sizeof(float);
GPU
float* A_d, B_d, C_d; CPU
Part 2
…
1. // Allocate device memory for A, B, and C
// copy A and B to device memory
Part 3
2. // Kernel launch code – to have the device
// to perform the actual vector addition
3. // copy C from the device memory
// Free device vectors
}
12
Partial Overview of CUDA Memories
• Device code can:
– R/W per-thread registers (Device) Grid
– R/W per-grid global memory
Block (0, 0) Block (1, 0)
• Host code can
– Transfer data to/from per grid global Registers Registers Registers Registers
memory
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Global
Host Memory
We will cover more later.
13
CUDA Device Memory
Management API functions
• cudaMalloc() Grid
– Allocates object in the device Block (0, 0) Block (1, 0)
global memory
– Two parameters
• Address of a pointer to the Registers Registers Registers Registers
allocated object
• Size of of allocated object in Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
terms of bytes
• cudaFree() Host Global Memory
– Frees object from device global
memory
• Pointer to freed object
14
Host-Device Data Transfer API
functions
• cudaMemcpy() (Device) Grid
– memory data transfer Block (0, 0) Block (1, 0)
– Requires four parameters
• Pointer to destination Registers Registers Registers Registers
• Pointer to source
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
• Number of bytes copied
• Type/Direction of transfer
Global
Host Memory
– Transfer to device is
asynchronous
15
void vecAdd(float* A, float* B, float* C, int n)
{
int size = n * sizeof(float);
float* A_d, B_d, C_d;
1. // Transfer A and B to device memory
cudaMalloc((void **) &A_d, size);
cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &B_d, size);
cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);
// Allocate device memory for
cudaMalloc((void **) &C_d, size);
2. // Kernel invocation code – to be shown later
…
3. // Transfer C from device to host
cudaMemcpy(C, C_d, size,
cudaMemcpyDeviceToHost);
// Free device memory for A, B, C
cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);
}
16
Check for API Errors in Host Code
cudaError_t err = cudaMalloc((void**)&d_A,size);
if (err!=cudaSuccess) {
printf(“%s in %s at line %d\n”,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
17
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A_d, float* B_d, float* C_d, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C_d[i] = A_d[i] + B_d[i];
}
int vectAdd(float* A, float* B, float* C, int n)
{
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 18
ECE408/CS483, University of Illinois, Urbana-Champaign
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAddkernel(float* A_d, float* B_d, float* C_d, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C_d[i] = A_d[i] + B_d[i];
}
Host Code
int vecAdd(float* A, float* B, float* C, int n)
{
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
vecAddKernnel<<<ceil(n/256),256>>>(A_d, B_d, C_d, n);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 19
ECE408/CS483, University of Illinois, Urbana-Champaign
More on Kernel Launch
Host Code
int vecAdd(float* A, float* B, float* C, int n)
{
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
dim3 DimGrid(n/256, 1, 1); dim3 DimGrid((n-1)/256 + 1, 1, 1);
if (n%256) DimGrid.x++;
dim3 DimBlock(256, 1, 1);
vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);
}
• Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit
synch needed for blocking
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 20
ECE408/CS483, University of Illinois, Urbana-Champaign
__host__
Kernel execution in a nutshell
Void vecAdd() __global__
{ void vecAddKernel(float *A_d,
dim3 DimGrid = (ceil(n/256,1,1); float *B_d, float *C_d, int n)
dim3 DimBlock = (256,1,1); {
int i = blockIdx.x * blockDim.x
vecAddKernel<<<DimGrid,DimBlock>>> + threadIdx.x;
(A_d,B_d,C_d,n);
} if( i<n ) C_d[i] = A_d[i]+B_d[i];
}
Kernel
Blk 0 Blk
•••
N-1
Schedule onto multiprocessors
GPU
M0 Mk
•••
RAM
21
More on CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
• __global__ defines a kernel function
• Each “__” consists of two underscore characters
• A kernel function must return void
• __device__ and __host__ can be used together
22
Compiling A CUDA Program
Integrated C programs with CUDA extensions
NVCC Compiler
Host Code Device Code (PTX)
Host C Compiler/ Device Just-in-Time
Linker Compiler
Heterogeneous Computing Platform with
CPUs, GPUs 23
QUESTIONS?
24