0% found this document useful (0 votes)

12 views28 pages

3-computation

The document provides a series of CUDA programming exercises and examples, including converting sequential C code to CUDA, handling memory between CPU and GPU, and executing kernels. It emphasizes the need for separate memory management for CPU and GPU, and outlines typical CUDA program flow. Additionally, it includes classwork and homework assignments to reinforce the concepts learned.

Uploaded by

webbstu1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views28 pages

3-computation

Uploaded by

webbstu1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

CUDA Programming

Recap
●
Write a CUDA code corresponding to the
following sequential C code.
Note
Notethat
thatthere
thereisis
#include <cuda.h> no
noloop
loophere.
here.
#include <stdio.h>
#define N 100
#define N 100 __global__ void fun() {
int main() { printf("%d\n", threadIdx.x *
int i; threadIdx.x);
for (i = 0; i < N; ++i) }
printf("%d\n", i * i); int main() {
return 0; fun<<<1, N>>>();
} cudaDeviceSynchronize();
return 0;
2
}
Classwork
●
Write a CUDA code corresponding to the
following sequential C code.
#include <stdio.h>
#define N 100
int main() {
int a[N], i;
for (i = 0; i < N; ++i)
a[i] = i * i;
return 0;
}

3
Classwork
●
Write a CUDA code corresponding to the
following sequential C code.
#include <stdio.h>
#include <stdio.h> #include <cuda.h>
#define N 100
#define N 100 __global__ void fun(int *a) {
int main() { a[threadIdx.x] = threadIdx.x * threadIdx.x;
}
int a[N], i; int main() {
int a[N], *da;
for (i = 0; i < N; ++i) int i;
a[i] = i * i; cudaMalloc(&da, N * sizeof(int));
fun<<<1, N>>>(da);
return 0; cudaMemcpy(a, da, N * sizeof(int),
} cudaMemcpyDeviceToHost);
for (i = 0; i < N; ++i)
printf("%d\n", a[i]);
return 0; 4
}
Classwork
●
Write a CUDA code corresponding to the
following sequential C code.
#include <stdio.h>
#include <stdio.h> #include <cuda.h>
#define N 100
#define N 100 __global__ void fun(int *a) {
int main() { a[threadIdx.x] = threadIdx.x * threadIdx.x;
}
int a[N], i; int main() {
int a[N], *da;
for (i = 0; i < N; ++i) int i;
a[i] = i * i; cudaMalloc(&da, N * sizeof(int));
fun<<<1, N>>>(da);
return 0; cudaMemcpy(a, da, N * sizeof(int),
} cudaMemcpyDeviceToHost);
for (i = 0; i < N; ++i)
printf("%d\n", a[i]);
Observation return 0; 5
No cudaDeviceSynchronize required. }
Hello World with a Global.
#include <stdio.h>
#include <cuda.h>
const char *msg = "Hello World.\n";
__global__ void dkernel() {
// no-op
}
int main() {
printf(msg);
return 0;
}

6
GPU Hello World with a Global.
#include <stdio.h>
#include <cuda.h>
const char *msg = "Hello World.\n";
__global__ void dkernel() {
printf(msg);
}
int main() {
dkernel<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}

7
GPU Hello World with a Global.
#include <stdio.h>
#include <cuda.h>
const char *msg = "Hello World.\n";
__global__ void dkernel() {
printf(msg);
}
int main() {
dkernel<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}

Compile: nvcc hello.cu 8

error: identifier "msg" is undefined in device code
GPU Hello World with a Global.
#include <stdio.h>
#include <cuda.h>
Takeaway
const char *msg = "Hello World.\n";
__global__ void dkernel() { CPU and GPU
printf(msg); memories are
separate
} (for discrete GPUs).
int main() {
dkernel<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}

Compile: nvcc hello.cu 9

error: identifier "msg" is undefined in device code
GPU Hello World with a Global.
#include <stdio.h>
#include <cuda.h>
Takeaway
#define msg "Hello World.\n"
__global__ void dkernel() { CPU and GPU
printf(msg); memories are
separate
} (for discrete GPUs).
int main() {
dkernel<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}

10
GPU Hello World with a Global.
#include <stdio.h>
#include <cuda.h>
Takeaway
#define msg "Hello World.\n"
__global__ void dkernel() { CPU and GPU
printf(msg); memories are
separate
} (for discrete GPUs).
int main() {
dkernel<<<1, 32>>>();
#define msg “Hello World.\n”
cudaDeviceSynchronize();
is okay.
return 0;
} Compile: nvcc hello.cu
Run: ./a.out
Hello World.
11
Hello World.
...
Separate Memories
D R A M D R A M

PCI Express
Bus
CPU GPU

12
Separate Memories
D R A M D R A M

PCI Express
Bus
CPU GPU

●
CPU and its associated (discrete) GPUs have
separate physical memory (RAM).
●
A variable in CPU memory cannot be accessed
directly in a GPU kernel.

13
Separate Memories
D R A M D R A M

PCI Express
Bus
CPU GPU

●
CPU and its associated (discrete) GPUs have
separate physical memory (RAM).
●
A variable in CPU memory cannot be accessed
directly in a GPU kernel.
●
A programmer needs to maintain copies of variables.
●
It is programmer's responsibility to keep them in sync.14
Typical CUDA Program Flow

CPU
CPU GPU
GPU

Load data
into CPU 1
memory.

File
System

15
Typical CUDA Program Flow
Copy data from CPU
to GPU memory.

CPU
CPU GPU
GPU

Load data
into CPU 1
memory.

File
System

16
Typical CUDA Program Flow
Copy data from CPU
to GPU memory.
Execute
2 3 GPU
kernel.

CPU
CPU GPU
GPU

Load data
into CPU 1
memory.

File
System

17
Typical CUDA Program Flow
Copy data from CPU
to GPU memory.
Execute
2 3 GPU
kernel.

CPU
CPU GPU
GPU

Load data 4
into CPU 1
memory. Copy results from
GPU to CPU memory.
File
System

18
Typical CUDA Program Flow
Copy data from CPU
to GPU memory.
Execute
Use 5 2 3 GPU
results on kernel.
CPU.
CPU
CPU GPU
GPU

Load data 4
into CPU 1
memory. Copy results from
GPU to CPU memory.
File
System

19
Typical CUDA Program Flow
1 Load data into CPU memory.
- fread / rand
2 Copy data from CPU to GPU memory.
- cudaMemcpy(..., cudaMemcpyHostToDevice)
3 Call GPU kernel.
- mykernel<<<x, y>>>(...)
4 Copy results from GPU to CPU memory.
- cudaMemcpy(..., cudaMemcpyDeviceToHost)
5 Use results on CPU. 20
Typical CUDA Program Flow

2 Copy data from CPU to GPU memory.

- cudaMemcpy(..., cudaMemcpyHostToDevice)

This means we need two copies of the same

variable – one on CPU another on GPU.
e.g., int *cpuarr, *gpuarr;
Matrix cpumat, gpumat;
21
Graph cpug, gpug;
CPU-GPU Communication
#include <stdio.h>
#include <cuda.h>
__global__ void dkernel(char *arr, int arrlen) {
unsigned id = threadIdx.x;
if (id < arrlen) {
++arr[id];
}
}

int main() {
char cpuarr[] = "Gdkkn\x1fVnqkc-",
*gpuarr;

cudaMalloc(&gpuarr, sizeof(char) * (1 + strlen(cpuarr)));

cudaMemcpy(gpuarr, cpuarr, sizeof(char) * (1 + strlen(cpuarr)), cudaMemcpyHostToDevice);
dkernel<<<1, 32>>>(gpuarr, strlen(cpuarr) + 1 );
cudaDeviceSynchronize(); // unnecessary, but okay.
cudaMemcpy(cpuarr, gpuarr, sizeof(char) * (1 + strlen(cpuarr)), cudaMemcpyDeviceToHost);
printf(cpuarr);

return 0;
} 22
CPU-GPU Communication
#include <stdio.h>
#include <cuda.h>
__global__ void dkernel(char *arr, int arrlen) {
unsigned id = threadIdx.x;
if (id < arrlen) {
++arr[id];
}
}

int main() {
char cpuarr[] = "Gdkkn\x1fVnqkc-",
*gpuarr;

cudaMalloc(&gpuarr, sizeof(char) * (1 + strlen(cpuarr)));

cudaMemcpy(gpuarr, cpuarr, sizeof(char) * (1 + strlen(cpuarr)), cudaMemcpyHostToDevice);
dkernel<<<1, 32>>>(gpuarr, strlen(cpuarr));
cudaDeviceSynchronize(); // unnecessary, but okay.
cudaMemcpy(cpuarr, gpuarr, sizeof(char) * (1 + strlen(cpuarr)), cudaMemcpyDeviceToHost);
printf(cpuarr);

return 0;
} 23
Classwork
1. Write a CUDA program to initialize an array of
size 32 to all zeros in parallel.

24
Classwork
1. Write a CUDA program to initialize an array of
size 32 to all zeros in parallel.
2. Change the array size to 1024.

25
Classwork
1. Write a CUDA program to initialize an array of
size 32 to all zeros in parallel.
2. Change the array size to 1024.
3. Create another kernel that adds i to array[i].

26
Classwork
1. Write a CUDA program to initialize an array of
size 32 to all zeros in parallel.
2. Change the array size to 1024.
3. Create another kernel that adds i to array[i].
4. Change the array size to 8000.
5. Check if answer to problem 3 still works.

27
Homework (z = x + y ) 2 3

●
Read a sequence of integers from a file.
●
Square each number.
●
Read another sequence of integers from
another file.
●
Cube each number.
●
Sum the two sequences element-wise, store in
the third sequence.
●
Print the computed sequence.

Yamaha Reface CP - Updater V1.30
No ratings yet
Yamaha Reface CP - Updater V1.30
8 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
2-Computation
No ratings yet
2-Computation
15 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
5 Functions
No ratings yet
5 Functions
34 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Lecture3 Fundamentals of CUDA(Part1)_2025
No ratings yet
Lecture3 Fundamentals of CUDA(Part1)_2025
52 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
CUDA_part-1
No ratings yet
CUDA_part-1
52 pages
Cuda Firstprograms PDF
No ratings yet
Cuda Firstprograms PDF
6 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
CUDA
No ratings yet
CUDA
33 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lab 10,11
No ratings yet
Lab 10,11
4 pages
hw2
No ratings yet
hw2
12 pages
Cuda C
No ratings yet
Cuda C
70 pages
NirajTamang Week8
No ratings yet
NirajTamang Week8
10 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Endsem Imp Hpc Unit 5
No ratings yet
Endsem Imp Hpc Unit 5
24 pages
CuPrintf Readme
No ratings yet
CuPrintf Readme
6 pages
5. Moving to Parallel With CUDA - Hello Program
No ratings yet
5. Moving to Parallel With CUDA - Hello Program
14 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
cuda
No ratings yet
cuda
4 pages
chapter-8
No ratings yet
chapter-8
58 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
CUDA_part-1-LMS
No ratings yet
CUDA_part-1-LMS
51 pages
Threads
No ratings yet
Threads
54 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Article Cloud Computing
No ratings yet
Article Cloud Computing
3 pages
Microprocessor /microcontroller Based Systems: Dr. Abbas Javed (Abbasjaved@cuilahore - Edu.pk)
No ratings yet
Microprocessor /microcontroller Based Systems: Dr. Abbas Javed (Abbasjaved@cuilahore - Edu.pk)
54 pages
DxDiag
No ratings yet
DxDiag
32 pages
United States Patent (10) Patent No.: US 8.484,696 B2
No ratings yet
United States Patent (10) Patent No.: US 8.484,696 B2
26 pages
Download ebooks file Emerging ICT Technologies and Cybersecurity: From AI and ML to Other Futuristic Technologies 1st Edition Kutub Thakur all chapters
100% (1)
Download ebooks file Emerging ICT Technologies and Cybersecurity: From AI and ML to Other Futuristic Technologies 1st Edition Kutub Thakur all chapters
40 pages
Basic Computer Notes (MS Office) (1)
No ratings yet
Basic Computer Notes (MS Office) (1)
192 pages
TEMPLATES
No ratings yet
TEMPLATES
10 pages
Input and Output Constructs
No ratings yet
Input and Output Constructs
10 pages
Software Packages 2
No ratings yet
Software Packages 2
5 pages
1101 Chapter 10 Essential Peripherals - Slide Handouts
No ratings yet
1101 Chapter 10 Essential Peripherals - Slide Handouts
44 pages
The Hacker Playbook 3 PDF
No ratings yet
The Hacker Playbook 3 PDF
410 pages
MP2200 User's Manual
No ratings yet
MP2200 User's Manual
300 pages
2025-05-30 - 20-37-41 Plugin Log
No ratings yet
2025-05-30 - 20-37-41 Plugin Log
2 pages
Instant ebooks textbook Introducing .NET 6 - Getting Started with Blazor, MAUI, Windows App SDK, Desktop Development, and Containers Nico Vermeir download all chapters
100% (3)
Instant ebooks textbook Introducing .NET 6 - Getting Started with Blazor, MAUI, Windows App SDK, Desktop Development, and Containers Nico Vermeir download all chapters
40 pages
Lesson 3 PCO Input Device Pointing Device-MIDTERM
No ratings yet
Lesson 3 PCO Input Device Pointing Device-MIDTERM
4 pages
Module 5 - Header and FooterUnit 5 - Header and Footer
No ratings yet
Module 5 - Header and FooterUnit 5 - Header and Footer
14 pages
Mehran Dsa Assign 2
No ratings yet
Mehran Dsa Assign 2
11 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Azure CSP Documntation
No ratings yet
Azure CSP Documntation
376 pages
Sheen Catalogue 2009 Rebrand
No ratings yet
Sheen Catalogue 2009 Rebrand
43 pages
SVLK 2015 Proceeding Upy
No ratings yet
SVLK 2015 Proceeding Upy
9 pages
Vijay Kumar Uyyala SWE
No ratings yet
Vijay Kumar Uyyala SWE
1 page
PythonReference 2022 en PDF
No ratings yet
PythonReference 2022 en PDF
594 pages
Physics Notes by Hari Sir - Watermark
No ratings yet
Physics Notes by Hari Sir - Watermark
72 pages
Select: Gregory Esteban Canon Ramírez 67000225
No ratings yet
Select: Gregory Esteban Canon Ramírez 67000225
8 pages
Misba Shaikh Viit
No ratings yet
Misba Shaikh Viit
1 page
Iot: Nodemcu 12E X Arduino Uno, Results of An Experimental and Comparative Survey
No ratings yet
Iot: Nodemcu 12E X Arduino Uno, Results of An Experimental and Comparative Survey
12 pages
Self-Installation Guide For Cisco AnyConnect VPN For Windows 7 & 10 OS - SMRI
No ratings yet
Self-Installation Guide For Cisco AnyConnect VPN For Windows 7 & 10 OS - SMRI
15 pages
Macro Macro Processor PDF
No ratings yet
Macro Macro Processor PDF
53 pages

3-computation

Uploaded by

3-computation

Uploaded by

CUDA Programming

Compile: nvcc hello.cu 8

Compile: nvcc hello.cu 9

2 Copy data from CPU to GPU memory.

This means we need two copies of the same

cudaMalloc(&gpuarr, sizeof(char) * (1 + strlen(cpuarr)));

cudaMalloc(&gpuarr, sizeof(char) * (1 + strlen(cpuarr)));

You might also like