0% found this document useful (0 votes)

36 views44 pages

Lecture2 Cuda Basic 2010

Uploaded by

DiogoNeto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views44 pages

Lecture2 Cuda Basic 2010

Uploaded by

DiogoNeto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Summer School

e-Science with Many-core CPU/GPU

Processors

Lecture 2
Introduction to CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu!

Braga, Portugal, June 14-18, 2010!
Overview
• CUDA programming model – basic concepts
and data types

• CUDA application programming interface -

simple examples to illustrate basic concepts and
functionalities

• Performance features will be covered later

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
Many Language/API Choices

C/C++
OpenCL
DirectX Compute
Fortran
Java
Python
.Net
ATI’s Compute “Solution”
…
(GPU HW, Driver, ISA…)
CUDA - C with no shader limitations
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code

Serial Code (host)‫‏‬

Parallel Kernel (device)‫‏‬

KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)‫‏‬

Parallel Kernel (device)‫‏‬

KernelB<<< nBlk, nTid >>>(args); ...
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Devices and Threads
• A compute device
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)‫‏‬
– Runs many threads in parallel
– Is typically a GPU but can also be another type of parallel
processing device
• Data-parallel portions of an application are expressed as
device kernels which run on many threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
A GPU – Graphics Mode
• The future of GPUs is programmable processing
• So – build the architecture around the processor
Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

Thread Processor
TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA mode – A Device Example
• Processors execute computing threads
• New operating mode/HW interface for computing
Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA C - extensions
• Declspecs
__device__ float filter[N];
– global, device,
shared, local, __global__ void convolve (float *image) {
constant
__shared__ float region[M];
...

• Keywords region[threadIdx] = image[i];

– threadIdx, blockIdx
__syncthreads()
• Intrinsics ...

– __syncthreads image[j] = result;

}

• Runtime API // Allocate GPU memory

void *myimage = cudaMalloc(bytes)‫‏‬
– Memory, symbol,
execution
management
// 100 blocks, 10 threads per block
• Function launch convolve<<<100, 10>>> (myimage);
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of
threads
– All threads run the same code (SPMD)‫‏‬
– Each thread has an index that it uses to compute
memory addresses and make control decisions
threads 0 1 2 3 4 5 6 7

…
float a = input[threadIdx];
float b = func(a);
output[threadIdx] = b;
…

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010! 9
Thread Blocks: Scalable Cooperation
• Divide monolithic thread array into multiple blocks
– Threads within a block cooperate via shared memory,
atomic operations and barrier synchronization
– Threads in different blocks cannot cooperate

Thread Block 0 Thread Block 1 Thread Block N - 1

threads 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

… … …
float a = input[threadIdx]; float a = input[threadIdx]; float a = input[threadIdx];
float b = func(a);
output[threadIdx] = b;
…
float b = func(a);
output[threadIdx] = b;
…
… float b = func(a);
output[threadIdx] = b;
…

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
blockIdx and threadIdx

• Each thread uses indices to

decide what data to work on
– blockIdx: 1D or 2D
– threadIdx: 1D, 2D, or 3D

• Simplifies memory
addressing when processing
multidimensional data
– Image processing
– Solving PDEs on volumes
– …

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAdd(float* A, float* B, float* C, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}

int main()
{
// Run ceil(N/256) blocks of 256 threads each
vecAdd<<<ceil(N/256), 256>>>(d_A, d_B, d_C, n);
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAdd(float* A, float* B, float* C, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
} Host Code

int main()
{
// Run ceil(N/256) blocks of 256 threads each
vecAdd<<<ceil(N/256), 256>>>(d_A, d_B, d_C, n);
}
__host__
Kernel execution in a nutshell
__global__
void example() void saxpy(int n, float a,
{ float *x, float *y)
int B = 128, {
P = ceil(n/B); int i = blockIdx.x * blockDim.x
saxpy<<<P,B>>>(n, a, x, y); + threadIdx.x;
}
if( i<n ) y[i] = a * x[i] + y[i];
}

Kernel
Blk 0 Blk
•••
p-1

Schedule onto multiprocessors

GPU
M0 Mk
•••
RAM RAM
CUDA Memory Model Overview

• Global memory
– Main means of
Grid
communicating R/W
Data between host and Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

device Shared Memory Shared Memory

– Contents visible to all Registers Registers Registers Registers

threads
– Long latency access Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

• We will focus on Host Global Memory

global memory for now

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
CUDA API Highlights:
Easy and Lightweight
• The API is an extension to the ANSI C
programming language
Low learning curve

• The hardware is designed to enable lightweight

runtime and driver
High performance

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
CUDA Device Memory Allocation

Grid

Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

Host Global
Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
CUDA Device Memory Allocation (cont.)‫‏‬
• Code example:
– Allocate a 64 * 64 single precision float array
– Attach the allocated storage to Md
– “d” is often used to indicate a device data
structure
TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Host-Device Data Transfer
• cudaMemcpy()‫‏‬
– memory data transfer Grid

– Requires four parameters Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

• Pointer to destination Shared Memory Shared Memory

• Pointer to source Registers Registers Registers Registers

• Number of bytes copied
• Type of transfer Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

– Host to Host
– Host to Device Host Global
Memory

– Device to Host
– Device to Device
• Asynchronous transfer
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Host-Device Data Transfer
(cont.)‫‏‬
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic constants

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
Example: Host code for vecAdd
int main()
{
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;

// allocate device (GPU) memory

float *d_A, *d_B, *d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device

cudaMemcpy(d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );
cudaMemcpy(d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );

// Execute the kernel on ceil(N/256) blocks of 256 threads each

vecAdd<<<ceil(N/256), 256>>>(d_A, d_B, d_C, n);

cudaMemcpy(h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost) );

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
CUDA Keywords

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc()‫‏‬ device device
__global__ void KernelFunc()‫‏‬ device host
__host__ float HostFunc()‫‏‬ host host

• global defines a kernel function

• Each “__” consists of two underscore characters
• A kernel function must return void
• __device__ and __host__ can be used
together
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
CUDA Function Declarations (cont.)‫‏‬

• device functions cannot have their

address taken
• For functions executed on the device:
– No recursion
– No static variable declarations inside the
function
– No variable number of arguments

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
Calling a Kernel Function – Thread
Creation
• A kernel function must be called with an
execution configuration:
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared
memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>
(...);
• Any call to a kernel function is asynchronous from
CUDA 1.0 on, explicit synch needed for blocking

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
A Simple Running Example
Matrix Multiplication
• A simple matrix multiplication example that
illustrates the basic features of memory and
thread management in CUDA programs
– Leave shared memory usage until later
– Local, register usage
– Thread index usage
– Memory data transfer API between host and device
– Assume square matrix for simplicity

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
Programming Model: Square
Matrix-Matrix Multiplication Example
• P = M * N of size WIDTH x WIDTH N

• Without tiling:

WIDTH
– One thread calculates one element
of P
– M and N are loaded WIDTH times
from global memory
M P

WIDTH
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Memory Layout of a Matrix in C
M0,0! M1,0! M2,0! M3,0!

M0,1! M1,1! M2,1! M3,1!

M0,2! M1,2! M2,2! M3,2!

M0,3! M1,3! M2,3! M3,3!

M0,0! M1,0! M2,0! M3,0! M0,1! M1,1! M2,1! M3,1! M0,2! M1,2! M2,2! M3,2! M0,3! M1,3! M2,3! M3,3!

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Portugal, June 14-18, 2010!
Step 1: Matrix Multiplication
A Simple Host Version in C
// Matrix multiplication on the (CPU) host in double
N
precision!
void MatrixMulOnHost(float* M, float* N, float* P, int Width)‫‏‬ k!
{
j!

WIDTH
for (int i = 0; i < Width; ++i)‫‏‬
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * widthM+ j]; P

sum += a * b; i!
}
P[i * Width + j] = sum;

WIDTH
}
} k!
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 2: Input Matrix Data Transfer
(Host-side Code)‫‏‬
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‫‏‬
{
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;!
…
1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device

cudaMalloc(&Pd, size);
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 3: Output Matrix Data Transfer
(Host-side Code)‫‏‬

2. // Kernel invocation code – to be shown later!

…!

3. // Read P from the device!

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);!

// Free device matrices!

cudaFree(Md); cudaFree(Nd); cudaFree (Pd);!
}!

Portugal, June 14-18, 2010!
Step 4: Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‫‏‬
{

// Pvalue is used to store the element of the matrix

// that is computed by the thread
float Pvalue = 0;

Portugal, June 14-18, 2010!
Step 4: Kernel Function (cont.)‫‏‬
for (int k = 0; k < Width; ++k)‫{ ‏‬ Nd

float Melement = Md[threadIdx.y*Width+k];

float Nelement = Nd[k*Width+threadIdx.x]; k!
Pvalue += Melement * Nelement;

WIDTH
} tx!

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd

ty! ty!

WIDTH
k! tx!
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration

dim3 dimGrid(1, 1);
dim3 dimBlock(Width, Width);

// Launch the device computation threads!

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Portugal, June 14-18, 2010!
Need to Extend to Multiple Block
Grid 1 Nd
• One Block of threads compute Block 1

matrix Pd
– Each thread computes one
element of Pd Thread
‫)‏‬2 ,2(

• Each thread
– Loads a row of matrix Md
– Loads a column of matrix Nd
– Perform one multiply and
addition for each pair of Md
and Nd elements
– Compute to off-chip memory 48
access ratio close to 1:1 (not
very high)‫‏‬
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block Pd
Md
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
Step 7: Handling Arbitrary Sized Square
Matrices
• Have each 2D thread block to Nd

compute a (TILE_WIDTH)2 sub-

matrix (tile) of the result matrix

WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/
TILE_WIDTH)2 blocks Pd
Md

You still need to put a loop by!

around the kernel call for TILE_WIDTH!
cases where WIDTH/

WIDTH
ty!
TILE_WIDTH is greater
than max grid size (64K)!! bx! tx!
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010!
A Small Example
• Have each 2D thread block to compute a (TILE_WIDTH)2
sub-matrix (tile) of the result matrix
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
Block(0,0) Block(1,0)

P0,0 P1,0 P2,0 P3,0 TILE_WIDTH = 2

P0,1 P1,1 P2,1 P3,1

P0,2 P1,2 P2,2 P3,2

P0,3 P1,3 P2,3 P3,3

Block(0,1) Block(1,1)
© David Kirk/NVIDIA and Wen-mei W. Hwu! 37
Urbana, Illinois, August 10-14, 2009!
A Small Example: Multiplication
Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

Urbana, Illinois, August 10-14, 2009!
Revised Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;
}© David Kirk/NVIDIA and Wen-mei W. Hwu! 39
Urbana, Illinois, August 10-14, 2009!
Revised Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration

dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH);
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);

// Launch the device computation threads!

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Urbana, Illinois, August 10-14, 2009!
Some Useful Information on
Tools

Portugal, June 14-18, 2010!
Compiling a CUDA Program
C/C++ CUDA float4 me = gx[gtid];
Application me.x += me.y * me.z;
• Parallel Thread
eXecution (PTX)‫‏‬
– Virtual Machine
NVCC CPU Code and ISA
– Programming
model
PTX Code – Execution
Virtual resources and
state
PhysicalPTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
mad.f32 $f1, $f5, $f3, $f1;
Compiler

GPU … GPU

Target code
© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,
Portugal, June 14-18, 2010! 42
Compilation
• Any source file containing CUDA language
extensions must be compiled with NVCC
• NVCC is a compiler driver
– Works by invoking all the necessary tools and
compilers like cudacc, g++, cl, ...
• NVCC outputs:
– C code (host CPU Code)‫‏‬
• Must then be compiled with the rest of the application using another tool

– PTX
• Object code directly
• Or, PTX source, interpreted at runtime

Portugal, June 14-18, 2010! 43
Linking
• Any executable with CUDA code requires two
dynamic libraries:
– The CUDA runtime library (cudart)‫‏‬
– The CUDA core library (cuda)‫‏‬

Portugal, June 14-18, 2010!

217 Lec2
No ratings yet
217 Lec2
24 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Threads
No ratings yet
Threads
54 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Lec 1
No ratings yet
Lec 1
27 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
CUDA
No ratings yet
CUDA
18 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
1 s2.0 S2214860425000478 Main
No ratings yet
1 s2.0 S2214860425000478 Main
37 pages
1 s2.0 S2352492825000248 Main
No ratings yet
1 s2.0 S2352492825000248 Main
28 pages
Jessica IGA Abaqus Book Chapter
No ratings yet
Jessica IGA Abaqus Book Chapter
11 pages
Iso 16368 2003
100% (1)
Iso 16368 2003
15 pages
Whole Sale Garments
No ratings yet
Whole Sale Garments
77 pages
Laborator 6: Tipuri de Baz A Operatori Pe Liste Definire Functii
No ratings yet
Laborator 6: Tipuri de Baz A Operatori Pe Liste Definire Functii
1 page
Ask Me (Ask4pc)
No ratings yet
Ask Me (Ask4pc)
2 pages
Oracle Linux Vs Rocky Linux
No ratings yet
Oracle Linux Vs Rocky Linux
47 pages
Sequence Diagram
No ratings yet
Sequence Diagram
25 pages
Python Tutorial For Beginners
No ratings yet
Python Tutorial For Beginners
20 pages
Behavior Driven Development in Requirement Engineering
No ratings yet
Behavior Driven Development in Requirement Engineering
8 pages
Objective: C Compression
No ratings yet
Objective: C Compression
19 pages
10.2 Project 1 Day 2 Activity Guide
No ratings yet
10.2 Project 1 Day 2 Activity Guide
4 pages
Practical Java Merged
No ratings yet
Practical Java Merged
37 pages
NextJS 15
No ratings yet
NextJS 15
20 pages
MTA Exam 98-375: HTML5 Application Development Fundamentals: Objective Domain
No ratings yet
MTA Exam 98-375: HTML5 Application Development Fundamentals: Objective Domain
4 pages
Syllabus Competitive Coding (CSP-314)
No ratings yet
Syllabus Competitive Coding (CSP-314)
5 pages
PERL Bioinformatics Course Guide
No ratings yet
PERL Bioinformatics Course Guide
2 pages
Sudhanshu Shiwarkar
No ratings yet
Sudhanshu Shiwarkar
4 pages
Thesis Computer Viruses
100% (3)
Thesis Computer Viruses
7 pages
Core Spring 4.0 4.1 Certification Study Guide
No ratings yet
Core Spring 4.0 4.1 Certification Study Guide
12 pages
CS504 Quiz-1 File by Vu Topper RM
No ratings yet
CS504 Quiz-1 File by Vu Topper RM
62 pages
TIA Portal Training for Engineers
No ratings yet
TIA Portal Training for Engineers
68 pages
Virtual COM For Ethernet Driver / Configuration Tool: Software Manual
No ratings yet
Virtual COM For Ethernet Driver / Configuration Tool: Software Manual
15 pages
Input Output Function With Answers
No ratings yet
Input Output Function With Answers
4 pages
Notes of Chapter 5 FUNTIONS
No ratings yet
Notes of Chapter 5 FUNTIONS
4 pages
Learn Microservices With Spring Boot 3: A Practical Approach Using Event-Driven Architecture, Cloud-Native Patterns, and Containerization 3rd Edition Moisés Macero García Full Chapters Included
100% (1)
Learn Microservices With Spring Boot 3: A Practical Approach Using Event-Driven Architecture, Cloud-Native Patterns, and Containerization 3rd Edition Moisés Macero García Full Chapters Included
177 pages
Module 5
No ratings yet
Module 5
3 pages
Harsha - Java
No ratings yet
Harsha - Java
6 pages
Vritika Tyagi
No ratings yet
Vritika Tyagi
9 pages
M.Tech CN&IS Course Syllabus
No ratings yet
M.Tech CN&IS Course Syllabus
35 pages
New Oop and Language Features in Visual Foxpro 9.0: Doug Hennig
No ratings yet
New Oop and Language Features in Visual Foxpro 9.0: Doug Hennig
17 pages
CS605 Assignment 2 Solution by M.junaid Qazi-1
No ratings yet
CS605 Assignment 2 Solution by M.junaid Qazi-1
2 pages
Embed Video in HTML
No ratings yet
Embed Video in HTML
14 pages

Lecture2 Cuda Basic 2010

Uploaded by

Lecture2 Cuda Basic 2010

Uploaded by

Summer School

e-Science with Many-core CPU/GPU

© David Kirk/NVIDIA and Wen-mei W. Hwu!

• CUDA application programming interface -

• Performance features will be covered later

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Serial Code (host)‫‏‬

Parallel Kernel (device)‫‏‬

Serial Code (host)‫‏‬

Parallel Kernel (device)‫‏‬

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

Thread Execution Manager

Load/store Load/store Load/store Load/store Load/store Load/store

• Keywords region[threadIdx] = image[i];

– __syncthreads image[j] = result;

• Runtime API // Allocate GPU memory

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Thread Block 0 Thread Block 1 Thread Block N - 1

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

• Each thread uses indices to

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Schedule onto multiprocessors

device Shared Memory Shared Memory

– Contents visible to all Registers Registers Registers Registers

• We will focus on Host Global Memory

global memory for now

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

• The hardware is designed to enable lightweight

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

Shared Memory Shared Memory

Registers Registers Registers Registers

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

– Requires four parameters Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

• Pointer to destination Shared Memory Shared Memory

• Pointer to source Registers Registers Registers Registers

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

// allocate device (GPU) memory

// copy host memory to device

// Execute the kernel on ceil(N/256) blocks of 256 threads each

cudaMemcpy(h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost) );

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

• __global__ defines a kernel function

• __device__ functions cannot have their

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

M0,1! M1,1! M2,1! M3,1!

M0,2! M1,2! M2,2! M3,2!

M0,3! M1,3! M2,3! M3,3!

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

// Allocate P on the device

2. // Kernel invocation code – to be shown later!

3. // Read P from the device!

// Free device matrices!

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

// Matrix multiplication kernel – per thread code

// Pvalue is used to store the element of the matrix

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

float Melement = Md[threadIdx.y*Width+k];

// Setup the execution configuration

// Launch the device computation threads!

© David Kirk/NVIDIA and Wen-mei W. Hwu Braga,

compute a (TILE_WIDTH)2 sub-

You still need to put a loop by!

P0,0 P1,0 P2,0 P3,0 TILE_WIDTH = 2

P0,2 P1,2 P2,2 P3,2

P0,3 P1,3 P2,3 P3,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

• global defines a kernel function

• device functions cannot have their