0% found this document useful (0 votes)
17 views44 pages

Lecture2 Cuda Basic 2010

Uploaded by

DiogoNeto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views44 pages

Lecture2 Cuda Basic 2010

Uploaded by

DiogoNeto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Summer School

e-Science with Many-core CPU/GPU


Processors

Lecture 2
Introduction to CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu!


Braga, Portugal, June 14-18, 2010!
Overview
• CUDA programming model – basic concepts
and data types

• CUDA application programming interface -


simple examples to illustrate basic concepts and
functionalities

• Performance features will be covered later

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Many Language/API Choices

C/C++
OpenCL
DirectX Compute
Fortran
Java
Python
.Net
ATI’s Compute “Solution”

(GPU HW, Driver, ISA…)
CUDA - C with no shader limitations
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code

Serial Code (host)‫‏‬

Parallel Kernel (device)‫‏‬


KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)‫‏‬

Parallel Kernel (device)‫‏‬


KernelB<<< nBlk, nTid >>>(args); ...
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Devices and Threads
• A compute device
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)‫‏‬
– Runs many threads in parallel
– Is typically a GPU but can also be another type of parallel
processing device
• Data-parallel portions of an application are expressed as
device kernels which run on many threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
A GPU – Graphics Mode
• The future of GPUs is programmable processing
• So – build the architecture around the processor
Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

Thread Processor
TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA mode – A Device Example
• Processors execute computing threads
• New operating mode/HW interface for computing
Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA C - extensions
• Declspecs
__device__ float filter[N];
– global, device,
shared, local, __global__ void convolve (float *image) {
constant
__shared__ float region[M];
...

• Keywords region[threadIdx] = image[i];


– threadIdx, blockIdx
__syncthreads()
• Intrinsics ...

– __syncthreads image[j] = result;


}

• Runtime API // Allocate GPU memory


void *myimage = cudaMalloc(bytes)‫‏‬
– Memory, symbol,
execution
management
// 100 blocks, 10 threads per block
• Function launch convolve<<<100, 10>>> (myimage);
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of
threads
– All threads run the same code (SPMD)‫‏‬
– Each thread has an index that it uses to compute
memory addresses and make control decisions
threads 0 1 2 3 4 5 6 7


float a = input[threadIdx];
float b = func(a);
output[threadIdx] = b;

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010! 9
Thread Blocks: Scalable Cooperation
• Divide monolithic thread array into multiple blocks
– Threads within a block cooperate via shared memory,
atomic operations and barrier synchronization
– Threads in different blocks cannot cooperate

Thread Block 0 Thread Block 1 Thread Block N - 1


threads 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

… … …
float a = input[threadIdx]; float a = input[threadIdx]; float a = input[threadIdx];
float b = func(a);
output[threadIdx] = b;

float b = func(a);
output[threadIdx] = b;

… float b = func(a);
output[threadIdx] = b;

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
blockIdx and threadIdx

• Each thread uses indices to


decide what data to work on
– blockIdx: 1D or 2D
– threadIdx: 1D, 2D, or 3D

• Simplifies memory
addressing when processing
multidimensional data
– Image processing
– Solving PDEs on volumes
– …

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAdd(float* A, float* B, float* C, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}

int main()
{
// Run ceil(N/256) blocks of 256 threads each
vecAdd<<<ceil(N/256), 256>>>(d_A, d_B, d_C, n);
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAdd(float* A, float* B, float* C, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
} Host Code

int main()
{
// Run ceil(N/256) blocks of 256 threads each
vecAdd<<<ceil(N/256), 256>>>(d_A, d_B, d_C, n);
}
__host__
Kernel execution in a nutshell
__global__
void example() void saxpy(int n, float a,
{ float *x, float *y)
int B = 128, {
P = ceil(n/B); int i = blockIdx.x * blockDim.x
saxpy<<<P,B>>>(n, a, x, y); + threadIdx.x;
}
if( i<n ) y[i] = a * x[i] + y[i];
}

Kernel
Blk 0 Blk
•••
p-1

Schedule onto multiprocessors

GPU
M0 Mk
•••
RAM RAM
CUDA Memory Model Overview

• Global memory
– Main means of
Grid
communicating R/W
Data between host and Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

device Shared Memory Shared Memory

– Contents visible to all Registers Registers Registers Registers

threads
– Long latency access Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

• We will focus on Host Global Memory

global memory for now

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
CUDA API Highlights:
Easy and Lightweight
• The API is an extension to the ANSI C
programming language
Low learning curve

• The hardware is designed to enable lightweight


runtime and driver
High performance

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
CUDA Device Memory Allocation

Grid

Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

Host Global
Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
CUDA Device Memory Allocation (cont.)‫‏‬
• Code example:
– Allocate a 64 * 64 single precision float array
– Attach the allocated storage to Md
– “d” is often used to indicate a device data
structure
TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Host-Device Data Transfer
• cudaMemcpy()‫‏‬
– memory data transfer Grid

– Requires four parameters Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

• Pointer to destination Shared Memory Shared Memory

• Pointer to source Registers Registers Registers Registers


• Number of bytes copied
• Type of transfer Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

– Host to Host
– Host to Device Host Global
Memory

– Device to Host
– Device to Device
• Asynchronous transfer
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Host-Device Data Transfer
(cont.)‫‏‬
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic constants

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Example: Host code for vecAdd
int main()
{
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;

// allocate device (GPU) memory


float *d_A, *d_B, *d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device


cudaMemcpy(d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );
cudaMemcpy(d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );

// Execute the kernel on ceil(N/256) blocks of 256 threads each


vecAdd<<<ceil(N/256), 256>>>(d_A, d_B, d_C, n);

cudaMemcpy(h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost) );

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
CUDA Keywords

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc()‫‏‬ device device
__global__ void KernelFunc()‫‏‬ device host
__host__ float HostFunc()‫‏‬ host host

• __global__ defines a kernel function


• Each “__” consists of two underscore characters
• A kernel function must return void
• __device__ and __host__ can be used
together
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Function Declarations (cont.)‫‏‬

• __device__ functions cannot have their


address taken
• For functions executed on the device:
– No recursion
– No static variable declarations inside the
function
– No variable number of arguments

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Calling a Kernel Function – Thread
Creation
• A kernel function must be called with an
execution configuration:
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared
memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>
(...);
• Any call to a kernel function is asynchronous from
CUDA 1.0 on, explicit synch needed for blocking

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
A Simple Running Example
Matrix Multiplication
• A simple matrix multiplication example that
illustrates the basic features of memory and
thread management in CUDA programs
– Leave shared memory usage until later
– Local, register usage
– Thread index usage
– Memory data transfer API between host and device
– Assume square matrix for simplicity

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Programming Model: Square
Matrix-Matrix Multiplication Example
• P = M * N of size WIDTH x WIDTH N

• Without tiling:

WIDTH
– One thread calculates one element
of P
– M and N are loaded WIDTH times
from global memory
M P

WIDTH
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Memory Layout of a Matrix in C
M0,0! M1,0! M2,0! M3,0!

M0,1! M1,1! M2,1! M3,1!

M0,2! M1,2! M2,2! M3,2!

M0,3! M1,3! M2,3! M3,3!

M!

M0,0! M1,0! M2,0! M3,0! M0,1! M1,1! M2,1! M3,1! M0,2! M1,2! M2,2! M3,2! M0,3! M1,3! M2,3! M3,3!

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Step 1: Matrix Multiplication
A Simple Host Version in C
// Matrix multiplication on the (CPU) host in double
N
precision!
void MatrixMulOnHost(float* M, float* N, float* P, int Width)‫‏‬ k!
{
j!

WIDTH
for (int i = 0; i < Width; ++i)‫‏‬
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * widthM+ j]; P

sum += a * b; i!
}
P[i * Width + j] = sum;

WIDTH
}
} k!
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 2: Input Matrix Data Transfer
(Host-side Code)‫‏‬
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‫‏‬
{
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;!

1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device


cudaMalloc(&Pd, size);
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 3: Output Matrix Data Transfer
(Host-side Code)‫‏‬

2. // Kernel invocation code – to be shown later!


…!

3. // Read P from the device!


cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);!

// Free device matrices!


cudaFree(Md); cudaFree(Nd); cudaFree (Pd);!
}!

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Step 4: Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‫‏‬
{

// Pvalue is used to store the element of the matrix


// that is computed by the thread
float Pvalue = 0;

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Step 4: Kernel Function (cont.)‫‏‬
for (int k = 0; k < Width; ++k)‫{ ‏‬ Nd

float Melement = Md[threadIdx.y*Width+k];


float Nelement = Nd[k*Width+threadIdx.x]; k!
Pvalue += Melement * Nelement;

WIDTH
} tx!

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd

ty! ty!

WIDTH
k! tx!
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration


dim3 dimGrid(1, 1);
dim3 dimBlock(Width, Width);

// Launch the device computation threads!


MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Need to Extend to Multiple Block
Grid 1 Nd
• One Block of threads compute Block 1

matrix Pd
– Each thread computes one
element of Pd Thread
‫)‏‬2 ,2(

• Each thread
– Loads a row of matrix Md
– Loads a column of matrix Nd
– Perform one multiply and
addition for each pair of Md
and Nd elements
– Compute to off-chip memory 48
access ratio close to 1:1 (not
very high)‫‏‬
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block Pd
Md
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 7: Handling Arbitrary Sized Square
Matrices
• Have each 2D thread block to Nd

compute a (TILE_WIDTH)2 sub-


matrix (tile) of the result matrix

WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/
TILE_WIDTH)2 blocks Pd
Md

You still need to put a loop by!


around the kernel call for TILE_WIDTH!
cases where WIDTH/

WIDTH
ty!
TILE_WIDTH is greater
than max grid size (64K)!! bx! tx!
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
A Small Example
• Have each 2D thread block to compute a (TILE_WIDTH)2
sub-matrix (tile) of the result matrix
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
Block(0,0) Block(1,0)

P0,0 P1,0 P2,0 P3,0 TILE_WIDTH = 2


P0,1 P1,1 P2,1 P3,1

P0,2 P1,2 P2,2 P3,2

P0,3 P1,3 P2,3 P3,3

Block(0,1) Block(1,1)
© David Kirk/NVIDIA and Wen-mei W. Hwu! 37
Urbana, Illinois, August 10-14, 2009!
A Small Example: Multiplication
Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu! 38


Urbana, Illinois, August 10-14, 2009!
Revised Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;
}© David Kirk/NVIDIA and Wen-mei W. Hwu! 39
Urbana, Illinois, August 10-14, 2009!
Revised Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration


dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH);
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);

// Launch the device computation threads!


MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu! 40


Urbana, Illinois, August 10-14, 2009!
Some Useful Information on
Tools

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!
Compiling a CUDA Program
C/C++ CUDA float4 me = gx[gtid];
Application me.x += me.y * me.z;
• Parallel Thread
eXecution (PTX)‫‏‬
– Virtual Machine
NVCC CPU Code and ISA
– Programming
model
PTX Code – Execution
Virtual resources and
state
PhysicalPTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
mad.f32 $f1, $f5, $f3, $f1;
Compiler

GPU … GPU

Target code
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010! 42
Compilation
• Any source file containing CUDA language
extensions must be compiled with NVCC
• NVCC is a compiler driver
– Works by invoking all the necessary tools and
compilers like cudacc, g++, cl, ...
• NVCC outputs:
– C code (host CPU Code)‫‏‬
• Must then be compiled with the rest of the application using another tool

– PTX
• Object code directly
• Or, PTX source, interpreted at runtime

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010! 43
Linking
• Any executable with CUDA code requires two
dynamic libraries:
– The CUDA runtime library (cudart)‫‏‬
– The CUDA core library (cuda)‫‏‬

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,


Portugal, June 14-18, 2010!

You might also like