0% found this document useful (0 votes)

40 views24 pages

217 Lec2

Uploaded by

palash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views24 pages

217 Lec2

Uploaded by

palash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

CS/EE 217

GPU Architecture and Programming

Lecture 2:
Introduction to CUDA C

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1

CUDA /OpenCL – Execution Model
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code

Serial Code (host)‫‏‬

Parallel Kernel (device)‫‏‬

KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)‫‏‬

Parallel Kernel (device)‫‏‬

KernelB<<< nBlk, nTid >>>(args); ...
2
From Natural Language to Electrons
Natural Language (e.g, English)
Algorithm
High-Level Language (C/C++…)
Compiler
Instruction Set Architecture
Microarchitecture
Circuits
Electrons
©Yale Patt and Sanjay Patel, From bits and bytes to gates and beyond

3
The ISA
• An Instruction Set Architecture (ISA) is a
contract between the hardware and the
software.

• As the name suggests, it is a set of

instructions that the architecture (hardware)
can execute.

4
A program at the ISA level
• A program is a set of instructions stored in
memory that can be read, interpreted, and
executed by the hardware.

• Program instructions operate on data stored

in memory or provided by Input/Output (I/O)
device.

5
The Von-Neumann Model
Memory
I/O

Processing Unit
Reg
ALU File

Control Unit
PC IR

6
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of
threads
– All threads in a grid run the same kernel code (SPMD)‫‏‬
– Each thread has an index that it uses to compute
memory addresses and make control decisions

0 1 2 254 255
…

i = blockIdx.x * blockDim.x +
threadIdx.x;
C_d[i] = A_d[i] + B_d[i];

…
7
7
Thread Blocks: Scalable Cooperation
• Divide thread array into multiple blocks
– Threads within a block cooperate via shared
memory, atomic operations and barrier
synchronization
– Threads in different blocks cannot cooperate
Thread Block 0 Thread Block 1 Thread Block N-1
0 1 2 254 255 0 1 2 254 255 0 1 2 254 255
… … …

i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x +

threadIdx.x; threadIdx.x; … threadIdx.x;
C_d[i] = A_d[i] + B_d[i]; C_d[i] = A_d[i] + B_d[i]; C_d[i] = A_d[i] + B_d[i];

… … …8
blockIdx and threadIdx
Host Device

• Each thread uses indices to Grid 1

decide what data to work on Kernel Block Block

1
– blockIdx: 1D, 2D, or 3D (0, 0) (1, 0)

(CUDA 4.0) Block Block

– threadIdx: 1D, 2D, or 3D (0, 1) (1, 1)

Grid 2

• Simplifies memory Kernel

2
addressing when processing Block (1, 1)
multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1)

– Image processing Thread Thread Thread Thread

– Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0)

– … Thread Thread Thread Thread

(0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy: NDVIA
9
Figure 3.2. An Example of CUDA Thread Org
Vector Addition – Conceptual View

vector
A A[0] A[1] A[2] A[3] A[4] A[N-1]
…

vector
B
B[0] B[1] B[2] B[3] B[4] … B[N-1]

+ + + + + +

vector C[0] C[1] C[2] C[3] C[4] C[N-1]

…
C

10
Vector Addition – Traditional C
Code
// Compute vector sum C = A+B
void vecAdd(float* A, float* B, float* C, int n)
{
for (i = 0, i < n, i++)
C[i] = A[i] + B[i];
}

int main()
{
// Memory allocation for A_h, B_h, and C_h
// I/O to read A_h and B_h, N elements
…
vecAdd(A_h, B_h, C_h, N);
}
11
Heterogeneous Computing vecAdd
Host Code
#include <cuda.h> Part 1
void vecAdd(float* A, float* B, float* C, int n)‫‏‬
{ Host Memory Device Memory
int size = n* sizeof(float);
GPU
float* A_d, B_d, C_d; CPU
Part 2
…
1. // Allocate device memory for A, B, and C
// copy A and B to device memory
Part 3
2. // Kernel launch code – to have the device
// to perform the actual vector addition

3. // copy C from the device memory

// Free device vectors
}
12
Partial Overview of CUDA Memories

• Device code can:

– R/W per-thread registers (Device) Grid
– R/W per-grid global memory
Block (0, 0) Block (1, 0)

• Host code can

– Transfer data to/from per grid global Registers Registers Registers Registers
memory
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Global
Host Memory

We will cover more later.

13
CUDA Device Memory
Management API functions
• cudaMalloc() Grid

– Allocates object in the device Block (0, 0) Block (1, 0)

global memory
– Two parameters
• Address of a pointer to the Registers Registers Registers Registers

allocated object
• Size of of allocated object in Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
terms of bytes
• cudaFree() Host Global Memory

– Frees object from device global

memory
• Pointer to freed object
14
Host-Device Data Transfer API
functions
• cudaMemcpy() (Device) Grid

– memory data transfer Block (0, 0) Block (1, 0)

– Requires four parameters

• Pointer to destination Registers Registers Registers Registers

• Pointer to source
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
• Number of bytes copied
• Type/Direction of transfer
Global
Host Memory

– Transfer to device is
asynchronous
15
void vecAdd(float* A, float* B, float* C, int n)
{
int size = n * sizeof(float);
float* A_d, B_d, C_d;

1. // Transfer A and B to device memory

cudaMalloc((void **) &A_d, size);
cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &B_d, size);
cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);
// Allocate device memory for
cudaMalloc((void **) &C_d, size);
2. // Kernel invocation code – to be shown later
…

3. // Transfer C from device to host

cudaMemcpy(C, C_d, size,
cudaMemcpyDeviceToHost);
// Free device memory for A, B, C
cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);
}

16
Check for API Errors in Host Code

cudaError_t err = cudaMalloc((void**)&d_A,size);

if (err!=cudaSuccess) {
printf(“%s in %s at line %d\n”,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

17
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A_d, float* B_d, float* C_d, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C_d[i] = A_d[i] + B_d[i];
}

int vectAdd(float* A, float* B, float* C, int n)

{
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 18
ECE408/CS483, University of Illinois, Urbana-Champaign
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAddkernel(float* A_d, float* B_d, float* C_d, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C_d[i] = A_d[i] + B_d[i];
}
Host Code
int vecAdd(float* A, float* B, float* C, int n)
{
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
vecAddKernnel<<<ceil(n/256),256>>>(A_d, B_d, C_d, n);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 19
ECE408/CS483, University of Illinois, Urbana-Champaign
More on Kernel Launch
Host Code
int vecAdd(float* A, float* B, float* C, int n)
{
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
dim3 DimGrid(n/256, 1, 1); dim3 DimGrid((n-1)/256 + 1, 1, 1);
if (n%256) DimGrid.x++;
dim3 DimBlock(256, 1, 1);

vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);

}

• Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit
synch needed for blocking
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 20
ECE408/CS483, University of Illinois, Urbana-Champaign
__host__
Kernel execution in a nutshell
Void vecAdd() __global__
{ void vecAddKernel(float *A_d,
dim3 DimGrid = (ceil(n/256,1,1); float *B_d, float *C_d, int n)
dim3 DimBlock = (256,1,1); {
int i = blockIdx.x * blockDim.x
vecAddKernel<<<DimGrid,DimBlock>>> + threadIdx.x;
(A_d,B_d,C_d,n);
} if( i<n ) C_d[i] = A_d[i]+B_d[i];
}

Kernel
Blk 0 Blk
•••
N-1

Schedule onto multiprocessors

GPU
M0 Mk
•••
RAM
21
More on CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc()‫‏‬ device device
__global__ void KernelFunc()‫‏‬ device host
__host__ float HostFunc()‫‏‬ host host

• global defines a kernel function

• Each “__” consists of two underscore characters
• A kernel function must return void
• __device__ and __host__ can be used together

22
Compiling A CUDA Program
Integrated C programs with CUDA extensions

NVCC Compiler

Host Code Device Code (PTX)

Host C Compiler/ Device Just-in-Time

Linker Compiler

Heterogeneous Computing Platform with

CPUs, GPUs 23
QUESTIONS?

Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Threads
No ratings yet
Threads
54 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
CUDA
No ratings yet
CUDA
18 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Cuda C
No ratings yet
Cuda C
70 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Cuda
No ratings yet
Cuda
4 pages
Vector Addition
No ratings yet
Vector Addition
3 pages
Majalah Playboy Indonesia PDF Find Anything To You
0% (1)
Majalah Playboy Indonesia PDF Find Anything To You
3 pages
Operating System Overview-Objectives and Functions: Dr. Tapas K. Mishra
No ratings yet
Operating System Overview-Objectives and Functions: Dr. Tapas K. Mishra
12 pages
Lesson 3 5 Major Components of PC
No ratings yet
Lesson 3 5 Major Components of PC
22 pages
Virtual Memory
No ratings yet
Virtual Memory
3 pages
Business Plan
No ratings yet
Business Plan
4 pages
True&False Questions
No ratings yet
True&False Questions
5 pages
The Raspberry Pi Glossary: Page 1
No ratings yet
The Raspberry Pi Glossary: Page 1
3 pages
Computer Architecture & Performance Analysis
No ratings yet
Computer Architecture & Performance Analysis
4 pages
Dev List
No ratings yet
Dev List
8 pages
Excel 170i Software Summary Chart
No ratings yet
Excel 170i Software Summary Chart
1 page
Bai Thi Cuoi Ky Mon Kien Truc May Tinh Va Hop Ngu
No ratings yet
Bai Thi Cuoi Ky Mon Kien Truc May Tinh Va Hop Ngu
13 pages
De&mp Unit - 5
No ratings yet
De&mp Unit - 5
30 pages
8085 Microprocessor Guide
No ratings yet
8085 Microprocessor Guide
7 pages
MCQS On Computer Ports - 51755470 - 2025 - 04 - 07 - 10 - 42
No ratings yet
MCQS On Computer Ports - 51755470 - 2025 - 04 - 07 - 10 - 42
8 pages
Dot Matrix Printer
No ratings yet
Dot Matrix Printer
8 pages
Netis Wireless WF2113 QIG V1.0
No ratings yet
Netis Wireless WF2113 QIG V1.0
9 pages
The Evolution of Computers From 1970 To The Present
No ratings yet
The Evolution of Computers From 1970 To The Present
63 pages
TB 96aiot 1126ce Hardware User Manual
No ratings yet
TB 96aiot 1126ce Hardware User Manual
10 pages
Uranus
No ratings yet
Uranus
9 pages
What Is Arduino? and Its Type
No ratings yet
What Is Arduino? and Its Type
12 pages
ELC 122 Memory Organisation
No ratings yet
ELC 122 Memory Organisation
7 pages
Ch01 Introduction & History Application
No ratings yet
Ch01 Introduction & History Application
24 pages
02 Intro Microcontroller
No ratings yet
02 Intro Microcontroller
38 pages
Inspiron 14 3467 Service Manual en Us
No ratings yet
Inspiron 14 3467 Service Manual en Us
73 pages
Lecture Notes 09
No ratings yet
Lecture Notes 09
29 pages
MCB2300 Ulink2
No ratings yet
MCB2300 Ulink2
4 pages
Higiente Oral 2
No ratings yet
Higiente Oral 2
4 pages
User'S Manual: Usb Autotap / Serial Autotap
No ratings yet
User'S Manual: Usb Autotap / Serial Autotap
26 pages
Five Years After Steve Jobs
No ratings yet
Five Years After Steve Jobs
9 pages
Computer Architecture Free Notes
No ratings yet
Computer Architecture Free Notes
18 pages