0% found this document useful (0 votes)

17 views12 pages

CUDA Programming for High-Performance Computing

The document provides an overview of the CUDA programming model, developed by NVIDIA, which allows developers to utilize the parallel processing capabilities of GPUs for high-performance computing tasks. It explains the architecture, memory management, and execution flow of CUDA, highlighting its applications in various fields and the benefits and limitations of using CUDA. Key concepts such as threads, blocks, and data exchange between CPU and GPU are also discussed to illustrate how CUDA operates effectively.

Uploaded by

rdinesh1106

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views12 pages

CUDA Programming for High-Performance Computing

Uploaded by

rdinesh1106

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

MODULE 5:HIGH PERFORMANCE COMPUTING WITH CUDA

CUDA PROGRAMMING MODEL

The CUDA programming model is a powerful parallel computing

platform and application programming interface (API) model
developed by NVIDIA. It enables developers to leverage the massive
parallel processing power of NVIDIA GPUs for High-Performance
Computing (HPC) tasks. CUDA stands for Compute Unified Device
Architecture.


CUDA stands for Compute Unified Device Architecture.

It is an extension of C/C++ programming. CUDA is a
programming language that uses the Graphical
Processing Unit (GPU). It is a parallel computing
platform and an API (Application Programming
Interface) model, Compute Unified Device Architecture
was developed by Nvidia. This allows computations to
be performed in parallel while providing well-formed
speed. Using CUDA, one can harness the power of the
Nvidia GPU to perform common computing tasks, such
as processing matrices and other linear algebra
operations, rather than simply performing graphical
calculations.

Why do we need CUDA?

 GPUs are designed to perform high-speed

parallel computations to display graphics such as
games.
 Use available CUDA resources. More than 100
million GPUs are already deployed.
 It provides 30-100x speed-up over other
microprocessors for some applications.
 GPUs have very small Arithmetic Logic Units
(ALUs) compared to the somewhat larger CPUs.
This allows for many parallel calculations, such
as calculating the color for each pixel on the
screen, etc.

Architecture of CUDA

 16 Streaming Multiprocessor (SM) diagrams are

shown in the above diagram.
 Each Streaming Multiprocessor has 8 Streaming
Processors (SP) ie, we get a total of 128
Streaming Processors (SPs).
 Now, each Streaming processor has a MAD unit
(Multiplication and Addition Unit) and an
additional MU (multiplication unit).
 The GT200 has 30 Streaming Multiprocessors
(SMs) and each Streaming Multiprocessor (SM)
has 8 Streaming Processors (SPs) ie, a total of
240 Streaming Processors (SPs), and more than
1 TFLOP processing power.
 Each Streaming Processor is gracefully threaded
and can run thousands of threads per
application.
 The G80 card has 16 Streaming Multiprocessors
(SMs) and each SM has 8 Streaming Processors
(SPs), i.e., a total of 128 SPs and it supports 768
threads per Streaming Multiprocessor (note: not
per SP).
 Eventually, after each Streaming Multiprocessor
has 8 SPs, each SP supports a maximal of 768/8
= 96 threads. Total threads that can run on 128
SPs - 128 * 96 = 12,228 times.
 Therefore these processors are
called massively parallel.
 The G80 chips have a memory bandwidth of
86.4GB/s.
 It also has an 8GB/s communication channel with
the CPU (4GB/s for uploading to the CPU RAM,
and 4GB/s for downloading from the CPU RAM).

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.

 Each kernel consists of blocks, which are
independent groups of ALUs.
 Each block contains threads, which are levels of
computation.
 The threads in each block typically work together
to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to
the GPU is often the most typical part of the
computation.
 For each thread, local memory is the fastest,
followed by shared memory, global, static, and
texture memory the slowest.
Typical CUDA Program flow
1. Load data into CPU memory
2. Copy data from CPU to GPU memory - e.g.,
cudaMemcpy(..., cudaMemcpyHostToDevice)
3. Call GPU kernel using device variable - e.g.,
kernel<<<>>> (gpuVar)
4. Copy results from GPU to CPU memory - e.g.,
cudaMemcpy(.., cudaMemcpyDeviceToHost)
5. Use results on CPU

How work is distributed?

 Each thread "knows" the x and y coordinates of

the block it is in, and the coordinates where it is
in the block.
 These positions can be used to calculate a
unique thread ID for each thread.
 The computational work done will depend on the
value of the thread ID.
 For example, the thread ID corresponds to a
group of matrix elements.

CUDA Applications

CUDA applications must run parallel operations on a lot

of data, and be processing-intensive.
1. Computational finance
2. Climate, weather, and ocean modelling
3. Data science and analytics
4. Deep learning and machine learning
5. Defence and intelligence
6. Manufacturing/AEC
7. Media and entertainment
8. Medical imaging
9. Oil and gas
10. Research
11. Safety and security
12. Tools and management

Benefits of CUDA
There are several advantages that give CUDA an edge
over traditional general-purpose graphics processor
(GPU) computers with graphics APIs:
 Integrated memory (CUDA 6.0 or later) and
Integrated virtual memory (CUDA 4.0 or later).
 Shared memory provides a fast area of shared
memory for CUDA threads. It can be used as a
caching mechanism and provides more
bandwidth than texture lookup.
 Scattered read codes can be read from any
address in memory.
 Improved performance on downloads and reads,
which works well from the GPU and to the GPU.
 CUDA has full support for bitwise and integer
operations.

Limitations of CUDA

 CUDA source code is given on the host machine

or GPU, as defined by the C++ syntax rules.
Longstanding versions of CUDA use C syntax
rules, which means that up-to-date CUDA source
code may or may not work as required.
 CUDA has unilateral interoperability(the ability of
computer systems or software to exchange and
make use of information) with transferor
languages like OpenGL. OpenGL can access
CUDA registered memory, but CUDA cannot
access OpenGL memory.
 Afterward versions of CUDA do not provide
emulators or fallback support for older versions.
 CUDA supports only NVIDIA hardware

BASIC PRINCIPLES OF CUDA PROGRAMMING

1. ✅ Heterogeneous Computing
 CUDA uses both the CPU (host) and GPU (device).
 CPU runs the main program, and the GPU handles parallel
computation tasks (called kernels).

2. ✅ Kernels and Threads

 A kernel is a function that runs on the GPU.

 When a kernel is launched, it runs many threads in parallel.

global void add(int a, int b, int *c) {

int i = threadIdx.x;
c[i] = a[i] + b[i];
}

3. ✅ Thread Hierarchy

Threads are grouped in a hierarchy:

 Threads are organized into blocks

 Blocks are organized into a grid

Each thread has built-in IDs:

 threadIdx, blockIdx, blockDim, gridDim

This allows scalable parallelism.

4. ✅ Memory Hierarchy

CUDA has multiple memory types:

Memory Type Scope Speed Use

Registers Per-thread Fastest Private variables
Shared Per-block Fast Block-level data sharing
Global All threads Slow Large data
Constant Read-only Cached Constant values
👉 Efficient memory use = better performance

5. ✅ SIMT (Single Instruction, Multiple Threads)

 GPU executes instructions in groups of 32 threads, called a

warp
 All threads in a warp execute the same instruction, but on
different data
 Like vectorization, but flexible

6. ✅ Host-Device Memory Management

You must manually allocate and copy memory between host and
device:
cudaMalloc(&d_array, size);
cudaMemcpy(d_array, h_array, size,
cudaMemcpyHostToDevice);

7. ✅ Synchronization and Communication

 Threads in the same block can synchronize using

__syncthreads
 ()
 Threads across blocks cannot directly communicate

 8. ✅ Parallel Execution

 Each thread can execute independently

 Ideal for data-parallel problems (e.g., vector addition, matrix
multiplication)

Example: CUDA Program Structure

// 1. Allocate memory
cudaMalloc(&d_A, size);
cudaMemcpy(d_A, h_A, size,
cudaMemcpyHostToDevice);

// 2. Launch kernel
myKernel<<<numBlocks, threadsPerBlock>>>(d_A);

// 3. Copy results back

cudaMemcpy(h_A, d_A, size,
cudaMemcpyDeviceToHost);

CONCEPTS OF THREAD AND BLOCKS

In CUDA (Compute Unified Device Architecture), which is
NVIDIA's parallel computing platform and API, the concepts of
threads and blocks are fundamental for writing GPU-accelerated
programs. Here's a clear breakdown:
1. Threads

 A thread is the smallest unit of execution in CUDA.

 Each thread runs the same kernel (a GPU function), but operates
on different data.
 Threads have unique IDs to distinguish them from each other,
accessible via.
 Smallest unit of execution.
 Each thread runs the same function (called a kernel) but on
different data.
 Identified by threadIdx.
 [Link] i = threadIdx.x;
 This line gives each thread its own index within a block.

2. Blocks (Thread Blocks)

 A block is a group of threads that execute the same kernel and
can:
o Share data via shared memory
o Synchronize using __syncthreads()
 Threads in a block are organized in 1D, 2D, or 3D:

dim3 blockDim(16, 16); // 2D block of 16x16

threads

Each thread in a block is identified by:

threadIdx.x, threadIdx.y, threadIdx.z
3. Grids

 A grid is a collection of blocks.

 You launch a kernel over a grid of blocks.

kernel<<<gridDim, blockDim>>>(...);

 Blocks in a grid also have IDs:

blockIdx.x, blockIdx.y, blockIdx.z

Total thread index in the grid:

int globalIdx = blockIdx.x * blockDim.x +

threadIdx.x;

Thread Hierarchy Recap

Level Identifier Scope

Thread threadIdx Inside a block
Block blockIdx Inside a grid
Block size blockDim Threads per block
Grid size gridDim Blocks per grid
Key Advantages

 Massive Parallelism: Thousands of threads can run

concurrently.
 Scalability: Same code scales with data and hardware.
 Memory Sharing: Threads in a block can share fast shared
memory.

Example: Vector Addition

global void vectorAdd(int a, int b, int

*c, int n) {
int i = blockIdx.x * blockDim.x +
threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i]

GPU AND CPU DATA EXCHANGE IN CUDA

In CUDA, CPU (Host) and GPU (Device) have separate memory

spaces, so data must be explicitly transferred between them. Here's
how data exchange works:

1. Memory Spaces

 Host (CPU): Uses standard RAM.

 Device (GPU): Uses its own global memory.

2. Data Transfer Steps

✅ Step 1: Allocate Memory on GPU

cudaMalloc((void**)&d_A, size);

✅ Step 2: Copy Data from CPU to GPU

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

✅ Step 3: Run the GPU Kernel

kernel<<<gridDim, blockDim>>>(d_A, d_B);

✅ Step 4: Copy Result from GPU to CPU

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

✅ Step 5: Free GPU Memory

cudaFree(d_A);
Transfer Direction Types
Direction Constant
CPU → GPU cudaMemcpyHostToDevice
GPU → CPU cudaMemcpyDeviceToHost
GPU ↔ GPU cudaMemcpyDeviceToDevice
Example Summary

int h_A, h_C; // Host pointers

int *d_A, *d_C; // Device pointers
size_t size = N * sizeof(int);

// 1. Allocate on GPU
cudaMalloc(&d_A, size);
cudaMalloc(&d_C, size);

// 2. Copy from Host to Device

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

// 3. Kernel Execution
myKernel<<<blocks, threads>>>(d_A, d_C);

// 4. Copy from Device to Host

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// 5. Free GPU memory

cudaFree(d_A); cudaFree(d_C);
🔄 CPU–GPU Data Exchange in CUDA (Briefly)

CUDA requires explicit data transfer between the CPU (host) and
GPU (device) because they have separate memory.

Key Steps

1. Allocate GPU Memory

cudaMalloc(&d_ptr, size);
2. Copy Data: CPU → GPU
cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice);
3. Execute Kernel on GPU
kernel<<<blocks, threads>>>(d_ptr);
4. Copy Result: GPU → CPU
cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);
5. Free GPU Memory
cudaFree(d_ptr);

Transfer Types

Direction CUDA Constant

Host → Device cudaMemcpyHostToDevice
Device → Host cudaMemcpyDeviceToHost
Device → Device cudaMemcpyDeviceToDevice

Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
5 pages
Understanding CUDA: GPU Programming Basics
No ratings yet
Understanding CUDA: GPU Programming Basics
4 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
31 pages
GPU Basics and Programming Overview
No ratings yet
GPU Basics and Programming Overview
93 pages
Overview of CUDA Programming
No ratings yet
Overview of CUDA Programming
7 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
17 pages
GPU Programming with CUDA Overview
No ratings yet
GPU Programming with CUDA Overview
42 pages
CUDA Programming: Advantages & Limits
No ratings yet
CUDA Programming: Advantages & Limits
35 pages
CUDA Programming Overview and Guide
No ratings yet
CUDA Programming Overview and Guide
28 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
28 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
18 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Oxford CUDA Programming Overview
No ratings yet
Oxford CUDA Programming Overview
21 pages
Understanding CUDA Architecture and Concepts
No ratings yet
Understanding CUDA Architecture and Concepts
25 pages
CUDA Architecture for Parallel Computing
No ratings yet
CUDA Architecture for Parallel Computing
4 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
33 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
77 pages
CUDA Programming and GPU Architecture
No ratings yet
CUDA Programming and GPU Architecture
21 pages
Understanding CUDA Programming Basics
No ratings yet
Understanding CUDA Programming Basics
54 pages
Introduction to CUDA Parallel Programming
No ratings yet
Introduction to CUDA Parallel Programming
25 pages
Overview of CUDA Architecture
No ratings yet
Overview of CUDA Architecture
26 pages
Understanding CUDA Programming Basics
No ratings yet
Understanding CUDA Programming Basics
45 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
CUDA Programming Basics and Tools
100% (1)
CUDA Programming Basics and Tools
173 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
84 pages
Multicore and GPU Processing Models
No ratings yet
Multicore and GPU Processing Models
58 pages
Gpu
No ratings yet
Gpu
59 pages
CUDA Parallel Programming Guide
No ratings yet
CUDA Parallel Programming Guide
25 pages
CUDA Data Structures Implementation
No ratings yet
CUDA Data Structures Implementation
13 pages
GPU Architecture and Programming Overview
No ratings yet
GPU Architecture and Programming Overview
67 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
55 pages
GPU Programming Basics and CUDA Overview
No ratings yet
GPU Programming Basics and CUDA Overview
28 pages
GPU Programming Basics and Techniques
No ratings yet
GPU Programming Basics and Techniques
65 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA High-Performance Computing Guide
No ratings yet
CUDA High-Performance Computing Guide
25 pages
CUDA GPU Programming Essentials
No ratings yet
CUDA GPU Programming Essentials
5 pages
GPU Computing for AI and Simulations
No ratings yet
GPU Computing for AI and Simulations
21 pages
GPU Overview: Architecture & Applications
No ratings yet
GPU Overview: Architecture & Applications
36 pages
NVIDIA CUDA Software and GPU Parallel Computing Ar
No ratings yet
NVIDIA CUDA Software and GPU Parallel Computing Ar
64 pages
GPU Architecture and CUDA Overview
100% (1)
GPU Architecture and CUDA Overview
50 pages
Understanding CUDA: Programming & Applications
No ratings yet
Understanding CUDA: Programming & Applications
24 pages
Understanding CUDA Architecture and Applications
No ratings yet
Understanding CUDA Architecture and Applications
10 pages
CUDA Parallel Programming Overview
No ratings yet
CUDA Parallel Programming Overview
93 pages
GPGPU Programming with CUDA Overview
No ratings yet
GPGPU Programming with CUDA Overview
29 pages
UNIT-3
No ratings yet
UNIT-3
59 pages
Understanding GPU Programming Basics
No ratings yet
Understanding GPU Programming Basics
6 pages
CUDA Programming Overview
No ratings yet
CUDA Programming Overview
38 pages
CUDA Programming for GPU Computing
No ratings yet
CUDA Programming for GPU Computing
26 pages
Introduction to CUDA GPU Programming
No ratings yet
Introduction to CUDA GPU Programming
94 pages
Introduction to CUDA C Programming
No ratings yet
Introduction to CUDA C Programming
70 pages
Introduction to CUDA C/C++ Basics
100% (1)
Introduction to CUDA C/C++ Basics
82 pages
Parallel 5
No ratings yet
Parallel 5
9 pages
Understanding GPUs and GPGPU Programming
No ratings yet
Understanding GPUs and GPGPU Programming
23 pages
Introduction to GPU Programming Basics
No ratings yet
Introduction to GPU Programming Basics
27 pages
Parallel Programming Techniques Explained
No ratings yet
Parallel Programming Techniques Explained
29 pages
Deploying and Securing Mule API
No ratings yet
Deploying and Securing Mule API
3 pages
Model-Based Reinforcement Learning Overview
No ratings yet
Model-Based Reinforcement Learning Overview
41 pages
KPI Calculations for Alarm Management
No ratings yet
KPI Calculations for Alarm Management
5 pages
10 Tips for Heritage Fairs Projects
No ratings yet
10 Tips for Heritage Fairs Projects
1 page
AI-Driven Crowd Management for Hajj
No ratings yet
AI-Driven Crowd Management for Hajj
10 pages
DecisionSpace Software Version 5000.0.1 Release Notes
100% (1)
DecisionSpace Software Version 5000.0.1 Release Notes
52 pages
Sony PMW 200 Manual
No ratings yet
Sony PMW 200 Manual
141 pages
X-Ray Equipment Pricing Overview
No ratings yet
X-Ray Equipment Pricing Overview
3 pages
Assembly Language Arithmetic Operations
No ratings yet
Assembly Language Arithmetic Operations
5 pages
Fiscal Device Communication Protocol
No ratings yet
Fiscal Device Communication Protocol
42 pages
C# Memory Management and Scanning
No ratings yet
C# Memory Management and Scanning
21 pages
Measuring Leaf Area: Methods & Analysis
No ratings yet
Measuring Leaf Area: Methods & Analysis
5 pages
Product Design Strategy for Smart Cars
No ratings yet
Product Design Strategy for Smart Cars
24 pages
Overview of Computer Architecture Components
No ratings yet
Overview of Computer Architecture Components
4 pages
Comprehensive Guide to Security Auditing
No ratings yet
Comprehensive Guide to Security Auditing
9 pages
CITAS Spreadsheet Usage Guide
No ratings yet
CITAS Spreadsheet Usage Guide
115 pages
209-914-0002-EN8 - D-BOX G3 Motion System Guide
No ratings yet
209-914-0002-EN8 - D-BOX G3 Motion System Guide
53 pages
GCP Data Engineer with Big Data Expertise
No ratings yet
GCP Data Engineer with Big Data Expertise
8 pages
BS Spring 2025 Date Sheet - University of Swabi
No ratings yet
BS Spring 2025 Date Sheet - University of Swabi
4 pages
Iperf User Documentation and Options
No ratings yet
Iperf User Documentation and Options
8 pages
Free Value Stream Mapping Template
No ratings yet
Free Value Stream Mapping Template
3 pages
Revised DepEd Freedom of Information Manual
No ratings yet
Revised DepEd Freedom of Information Manual
32 pages
Generative AI in Visualization: Trends & Insights
No ratings yet
Generative AI in Visualization: Trends & Insights
24 pages
Electronics Teach-In 2000 Part 3
No ratings yet
Electronics Teach-In 2000 Part 3
84 pages
Dogecoin Node Setup and Tips
No ratings yet
Dogecoin Node Setup and Tips
33 pages
VTube Studio Webcam Setup Issues
No ratings yet
VTube Studio Webcam Setup Issues
7 pages
Neural-Hill: Efficient IoT-Cloud Scheduling
No ratings yet
Neural-Hill: Efficient IoT-Cloud Scheduling
10 pages
PL/SQL Programs for Number Operations
No ratings yet
PL/SQL Programs for Number Operations
5 pages
Workday Material PDF
90% (21)
Workday Material PDF
303 pages

CUDA Programming for High-Performance Computing

Uploaded by

CUDA Programming for High-Performance Computing

Uploaded by

MODULE 5:HIGH PERFORMANCE COMPUTING WITH CUDA

CUDA PROGRAMMING MODEL

The CUDA programming model is a powerful parallel computing

CUDA stands for Compute Unified Device Architecture.

Why do we need CUDA?

 GPUs are designed to perform high-speed

 16 Streaming Multiprocessor (SM) diagrams are

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.

How work is distributed?

 Each thread "knows" the x and y coordinates of

CUDA applications must run parallel operations on a lot

 CUDA source code is given on the host machine

BASIC PRINCIPLES OF CUDA PROGRAMMING

2. ✅ Kernels and Threads

 A kernel is a function that runs on the GPU.

__global__ void add(int *a, int *b, int *c) {

Threads are grouped in a hierarchy:

 Threads are organized into blocks

Each thread has built-in IDs:

 threadIdx, blockIdx, blockDim, gridDim

This allows scalable parallelism.

CUDA has multiple memory types:

Memory Type Scope Speed Use

5. ✅ SIMT (Single Instruction, Multiple Threads)

 GPU executes instructions in groups of 32 threads, called a

6. ✅ Host-Device Memory Management

7. ✅ Synchronization and Communication

 Threads in the same block can synchronize using

 Each thread can execute independently

Example: CUDA Program Structure

// 3. Copy results back

CONCEPTS OF THREAD AND BLOCKS

 A thread is the smallest unit of execution in CUDA.

2. Blocks (Thread Blocks)

dim3 blockDim(16, 16); // 2D block of 16x16

Each thread in a block is identified by:

 A grid is a collection of blocks.

 Blocks in a grid also have IDs:

blockIdx.x, blockIdx.y, blockIdx.z

int globalIdx = blockIdx.x * blockDim.x +

Thread Hierarchy Recap

Level Identifier Scope

 Massive Parallelism: Thousands of threads can run

Example: Vector Addition

__global__ void vectorAdd(int *a, int *b, int

GPU AND CPU DATA EXCHANGE IN CUDA

In CUDA, CPU (Host) and GPU (Device) have separate memory

 Host (CPU): Uses standard RAM.

2. Data Transfer Steps

✅ Step 1: Allocate Memory on GPU

✅ Step 2: Copy Data from CPU to GPU

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

kernel<<<gridDim, blockDim>>>(d_A, d_B);

✅ Step 4: Copy Result from GPU to CPU

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

✅ Step 5: Free GPU Memory

int *h_A, *h_C; // Host pointers

// 2. Copy from Host to Device

// 4. Copy from Device to Host

// 5. Free GPU memory

1. Allocate GPU Memory

Direction CUDA Constant

You might also like

global void add(int a, int b, int *c) {

global void vectorAdd(int a, int b, int

int h_A, h_C; // Host pointers