0% found this document useful (0 votes)

44 views42 pages

CSE 599 I Accelerated Computing - Programming GPUs Lecture 15

The document introduces CUDA Dynamic Parallelism, a feature that allows threads to launch other kernels, enhancing the CUDA programming model. It discusses various applications, rules, and restrictions associated with dynamic parallelism, including memory management and synchronization. Additionally, it provides examples of practical implementations, such as drawing Bezier curves and building quadtrees.

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views42 pages

CSE 599 I Accelerated Computing - Programming GPUs Lecture 15

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

CSE 599 I

Accelerated Computing -
Programming GPUS
CUDA Dynamic Parallelism
Objective

● Introduce dynamic parallelism, a relatively recent CUDA technique in which

kernels launch kernels
● Learn about various rules and restrictions that apply to dynamic parallelism
● Study some prototypical applications of dynamic parallelism
What is Dynamic Parallelism

An extension to the CUDA programming model which allows a thread to launch

another grid of threads executing another kernel

First introduced with the Kepler architecture (2012)

Uses for Dynamic Parallelism

● Recursive algorithms
● Processing at different levels of detail for different parts of the input (i.e.
irregular grid structure)
● Algorithms in which new work is “uncovered” along the way
Work Discovery Without Dynamic Parallelism
__global__ void workDiscoveryKernel(const int * starts, const int * ends, float * data) {

int i = threadIdx.x + blockDim.x * blockIdx.x;

for (int j = starts[i]; j < ends[i]; ++j) {

process(data[j]);
}

}
Work Discovery

Thread
0

Thread
1

CPU

Thread
2

Thread
3

Without dynamic parallelism

Work Discovery With Dynamic Parallelism
__global__ void workDiscoveryKernel(const int * starts, const int * ends, float * data) {

int i = threadIdx.x + blockDim.x * blockIdx.x;

const int N = ends[i] - starts[i];

workDiscoveryChildKernel<<<(N-1)/128+1,128>>>(data + starts[i], N);

global void workDiscoveryChildKernel(float * data, const int N) {

int j = threadIdx.x + blockDim.x * blockIdx.x;

if (j < N) {
process(data[j]);
}

}
Work Discovery

Thread Thread Thread Thread

0 1
0 0

Thread Thread Thread Thread Thread Thread

0 1 2 3
1 1

CPU CPU

Thread Thread Thread

0
2 2

Thread Thread Thread Thread Thread

0 1 3
3 3

Without dynamic parallelism With dynamic parallelism

Work Discovery

Thread Thread Thread Thread

0 1
0 0

Thread Thread Thread Thread Thread Thread

0 1 0 1
1 1

CPU CPU

Thread Thread Thread

2 0 These can be done
2
in parallel!

Thread Thread Thread Thread Thread

0 1 0
3 3

Without dynamic parallelism With dynamic parallelism

Global Memory and Dynamic Parallelism

Parent and child grids have two points of guaranteed global memory
consistency:

1. When the child grid is launched by the parent; all memory operations
performed by the parent thread before launching the child are visible to the
child grid when it starts
2. When the child grid finishes; all memory operations by any thread in the
child grid are visible to the parent thread once the parent thread has
synchronized with the completed child grid
Constant Memory and Dynamic Parallelism

Constant memory also cannot be changed from within a child grid or before
launching a child grid

Thus, all constant memory must be set on the host before launching the parent
kernel and remain constant for the duration of the entire kernel tree
Local Memory and Dynamic Parallelism

Local memory is private to a thread, and dynamic parallelism is not exception

Child grids have no privileged access to the parent thread’s local data

Not OK OK

global void badParentKernel() { global void goodParentKernel(float * data)

{
float data[10];
childKernel<<<...>>>(data); childKernel<<<...>>>(data);

} }

global void badParentKernel() { device float value;

__global__ void goodParentKernel(float * data)
float value; {
childKernel<<<...>>>(&value);
childKernel<<<...>>>(&value);
}
}
Shared Memory and Dynamic Parallelism

Shared memory is private to a block of threads, and dynamic parallelism is no

exception

Parent threads have no privileged access to a child block’s shared memory

Memory Allocation from within a Kernel

In addition to kernel launches, dynamic parallelism allows memory allocation

from within a kernel via cudaMalloc() and cudaFree()

A few differences about allocating memory from within a kernel:

● Cannot allocate zero-copy memory

● The allocation limit is the device malloc heap size, which may be smaller
than the total device memory size
○ You can get or set this limit using cudaDevice[Get/Set]Limit() with the parameter
cudaLimitMallocHeapSize
● Memory allocated with cudaMalloc() inside a kernel must be freed with
cudaFree() from inside a kernel, and a kernel cannot call cudaFree() with a
pointer that was allocated on the host
Kernels All the Way Down

A kernel launched from within a kernel can launch a kernel, which can also
launch a kernel, etc.

The total “nesting depth” allowed with dynamic parallelism is limited to 24

There are other limits that tend to come up before the maximum nesting depth
Dynamic Parallelism with Multiple GPUs

Kernels launched from within a kernel cannot be executed on another GPU

Pending Launch Pool

The pending launch pool is a buffer that keeps track of kernels that are currently
being executed or waiting to be executed

By default, the pending launch pool has room for 2048 kernels before spilling
into a virtualized pool, which is very slow

Like the device malloc heap size, this limit can be queried or set using
cudaDevice[Get/Set]Limit(), this time with parameter
cudaLimitDevRuntimePendingLaunchCount
Implicit Synchronization

A parent thread is implicitly synchronized with its children before terminating

CPU

Parent grid completes

Parent grid Implicit sync

Child grid completes

Child grid
Explicit Synchronization

A parent thread can also explicitly synchronize with child grids using
cudaDeviceSynchronize()

This blocks the calling thread on all child grids created by all threads in the block

Blocking all threads can be done by calling cudaDeviceSynchronize() from all

threads or following a call by one thread with __syncthreads()
Synchronization Depth

A parent kernel that performs explicit synchronization on a child grid may be

swapped out while waiting for the child grid to finish

This requires storing the entire state of the kernel, i.e. registers, shared memory,
program counters, etc.

The deepest nesting level at which synchronization is performed is referred to

as the synchronization depth

Synchronization depth is limited by the size of the backing store, which can be
checked or set using cudaDevice[Get/Set]Limit() and the parameter
cudaLimitDevRuntimeSyncDepth
Streams and Dynamic Parallelism

● Kernels can launch new kernels in both the default and non-default streams
to be executed concurrently

● Child kernels launched in explicit streams must use streams that were
allocated from within the kernel that launched them

● The scope of a stream is a block; there can be no sharing of streams

between host and device, between blocks, or between parent and child
Streams and Dynamic Parallelism

● If no stream is specified, the default stream is used, serializing all kernels

launched in the same block (even by different threads)

● cudaStreamSynchronize() cannot be called by device code;

cudaDeviceSynchronize() must be used to wait for all child grids luanched
by the block

● All device streams must be non-blocking. To force awareness of this on the

programmer, streams created by the device must use
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking)
Events and Dynamic Parallelism

Events also have some support in device code, but not the full functionality

Currently, only cudaStreamWaitEvent() is allowed to be called from a kernel (no

timing or event synchronization)

Events are scoped to the block (like streams)

Events consume device memory, so there is no limit, but too many events risks
reduced concurrency
Example: Drawing Bezier Curves

A Bezier curve is a smooth curved defined by a set of n control points, where n

determines the degree of the curve

For n = 3, the curve is a quadratic Bezier curve defined by control points P0, P1,
and P2, and the following equation:

B(t) = (1-t)2 P0 + 2(1-t)t P1 + t2 P2

Example: Drawing Bezier Curves

A Bezier curve is defined over a continuous domain

We’ll be looking at a kernel to compute a set of discrete points along a

user-defined Bezier curve

To make the curve look smooth, we’ll want to compute more points in
high-curvature regions

Low curvature: High curvature:

Example: Drawing Bezier Curves
#define MAX_NUM_POINTS 128

struct BezierCurve {
float2 controlPoints[3];
float2 vertices[MAX_NUM_POINTS];
int numVertices;
};

device float computeCurvature(const BezierCurve * curve) {

return length(curve->controlPoints[1] - 0.5*(curve->controlPoints[0] +
curve->controlPoints[2])) / length(curve->controlPoints[2] -
curve->controlPoints[0]);
}

We’ll be given a curves with controlPoints set, and we want to compute

vertices
Example: Drawing Bezier Curves
__global__ void computeBezierCurvesKernel(BezierCurve * curves, const int N) {

if (blockIdx.x < N) {

const float curvature = computeCurvature(&curves[blockIdx.x]);

// compute number of points based on curvature, between 4 and MAX_NUM_POINTS

const int nVertices = min(max((int)(curvature * 64.f),4,MAX_NUM_POINTS);
curves[blockIdx.x].numVertices = nVertices;

for (int p = threadIdx.x; p < nVertices; p += blockDim.x) {

const float t = p / (float)(nVertices - 1);

const float oneMinusT = 1.f - t;

float2 position = oneMinusT * oneMinusT * curves[blockIdx.x].controlPoints[0] +

2.f * t * oneMinusT * curves[blockIdx.x].controlPoints[1] +
t * t * curves[blockIdx.x].controlPoints[2];

curves[blockIdx.x].vertices[p] = position;

}
Example: Drawing Bezier Curves
#define MAX_NUM_POINTS 128

struct BezierCurve {
float2 controlPoints[3];
float2 * vertices;
int numVertices;
};

With dynamic parallelism, we won’t need to statically declare the size of the
vertices buffer
Example: Drawing Bezier Curves
__global__ void computeBezierCurvesParentKernel(BezierCurve * curves, const int N) {

const int i = threadIdx.x + blockDim.x * blockIdx.x;

if (i < N) {

const float curvature = computeCurvature(&curves[i]);

// compute number of points based on curvature, between 4 and MAX_NUM_POINTS

curves[i].numVertices = min(max((int)(curvature * 64.f),4,MAX_NUM_POINTS);
cudaMalloc(&curves[i].vertices, curves[i].numVertices * sizeof(float2));

cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

computeBezierCurvesChildKernel<<<(curves[i].numVertices-1)/32+1,32,0,stream>>>(&curves[i]);

cudaStreamDestroy(stream);

}
Example: Drawing Bezier Curves
__global__ void computeBezierCurveChildKernel(BezierCurve * curve) {

const int p = threadIdx.x + blockDim.x * blockIdx.x;

if (p < curve->numVertices) {

const float t = p / (float)(nVertices - 1);

const float oneMinusT = 1.f - t;

float2 position = oneMinusT * oneMinusT * curves[blockIdx.x].controlPoints[0] +

2.f * t * oneMinusT * curves[blockIdx.x].controlPoints[1] +
t * t * curves[blockIdx.x].controlPoints[2];

curve->vertices[p] = position;

}
Example: Drawing Bezier Curves
__global__ void cleanupKernel(BezierCurve * curves, const int N) {

const int i = threadIdx.x + blockDim.x * blockIdx.x;

if (i < N) {

cudaFree(curves[i]->vertices);

}
Recursive Example: Quadtrees

A quadtree is a tree specially designed for storing 2D points

Each node represents a square in the plane, and has exactly 4 children, each
representing a quadrant of the square

D
A

A B
C

G
F

E J
G H I
D K

I
H

B
F
C
J

K
Recursive Example: Quadtrees

A quadtree is a tree specially designed for storing 2D points

Each node represents a square in the plane, and has exactly 4 children, each
representing a quadrant of the square

D
A

A B
C

G
F

E J
G H I
D K
?
I
H

B
F
C
J

K
Recursive Example: Quadtrees

A quadtree is a tree specially designed for storing 2D points

Each node represents a square in the plane, and has exactly 4 children, each
representing a quadrant of the square

D
A

A B
C

G
F

E J
G H I
D K
?
I
H

B
F
C
J

K
Recursive Example: Quadtrees

A quadtree is a tree specially designed for storing 2D points

Each node represents a square in the plane, and has exactly 4 children, each
representing a quadrant of the square

D
A

A B
C

G
F

E J
G H I
D K
?
I
H

B
F
C
J

K
Recursive Example: Quadtrees

points: A B C D E F G H I J K

D
A

A B
C

G
F

I
H

K
Recursive Example: Quadtrees

reorder
points: B C D E F G D E G B C F

D
A

A B
C

G
F

E
G
D

I
H

K
Recursive Example: Quadtrees

A D E G B C F H I J K

D
A

0 B
C

G
F

1 9
3 7 8 10
2

I
H

4
6
5
J

K
Recursive Example: Quadtrees
__global__ void buildQuadtreeKernel(QuadtreeNode * nodes, float2 * pointsA, float2 * pointsB, Parameters params) {

shared int smem[8];

QuadtreeNode & node = nodes[blockIdx.x];

const int numPoints = node.numPoints;

// recursive base case

if (numPoints < params.pointThreshold || node.depth() > params.maxDepth) return;

const BoundingBox & bbox = node.boundingBox;

const float2 center = bbox.center();

const int pointsStart = node.pointsStart;

const int pointsEnd = node.pointsEnd;

// compute number of points for each child and store result in shared memory
countPointsInChildNodes(pointsA + pointsStart, pointsEnd - pointsStart, center, smem);

// do a scan on the number of points for each child to compute offsets

scanForOffsets(smem);

// move the points

reorderPoints(pointsA + pointsStart, pointsB + pointsStart, pointsEnd - pointsStart, center, smem);

if (threadIdx.x == blockDim.x - 1) {
cudaMalloc(&node.children, 4 * sizeof(QuadtreeNode)); // allocate memory for the four children
prepareChildren(node, smem); // set bounding boxes, etc. for the children
buildQuadtreeKernel<<<4, blockDim.x>>>(node.children, pointsB, pointsA, params);
}

}
Recursive Algorithms in CUDA Before 2012

Technically, recursion has always been possible

However, it required awkward loop unrolling

Essentially, one had to implement a call stack within the kernel

Conclusion / Takeaways

● Dynamic parallelism is a powerful new tool allowing kernels to perform

recursive functions and dynamically redistribute work for better load
balancing
Sources

https://www.wikipedia.org/

Kirk, David B., and W. Hwu Wen-Mei. Programming massively parallel processors:
a hands-on approach. Morgan Kaufmann, 2016.

Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
CUDA Dynamic Parallelism Programming Guide
No ratings yet
CUDA Dynamic Parallelism Programming Guide
30 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Week 11
No ratings yet
Week 11
21 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
No ratings yet
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
27 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
TechBrief Dynamic Parallelism in CUDA
No ratings yet
TechBrief Dynamic Parallelism in CUDA
3 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
No ratings yet
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
22 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Gpu Test Answer Bank
No ratings yet
Gpu Test Answer Bank
22 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Case Study On GPU Architectures: Lecture 3H
No ratings yet
Case Study On GPU Architectures: Lecture 3H
34 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
PDC Lecture 10
No ratings yet
PDC Lecture 10
32 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Section 1
No ratings yet
Section 1
4 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
Class 13
No ratings yet
Class 13
19 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
P E O A: Hilippine Agle Ptimization Lgorithm
No ratings yet
P E O A: Hilippine Agle Ptimization Lgorithm
34 pages
Clustering
No ratings yet
Clustering
1 page
1.10. Decision Trees - Scikit-Learn 0.24.1 Documentation
No ratings yet
1.10. Decision Trees - Scikit-Learn 0.24.1 Documentation
10 pages
KNN in Python
No ratings yet
KNN in Python
11 pages
Loading Pandas
No ratings yet
Loading Pandas
23 pages
بارگذاری فایل
No ratings yet
بارگذاری فایل
2 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
08ClassBasic v1
No ratings yet
08ClassBasic v1
46 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Subdivision
No ratings yet
Subdivision
5 pages
01 Laurie Stephey
No ratings yet
01 Laurie Stephey
14 pages
Analysis of Non-Productive Time (NPT) in Drilling Operations-A Case Study of The Ghadames Basin
No ratings yet
Analysis of Non-Productive Time (NPT) in Drilling Operations-A Case Study of The Ghadames Basin
10 pages
Eyad CV
No ratings yet
Eyad CV
1 page
Putting Artificial Intelligence To Work in Law Firms
No ratings yet
Putting Artificial Intelligence To Work in Law Firms
5 pages
Data-Scientist-Train - 20240404112448 - 160 DELHI NCR DETA
No ratings yet
Data-Scientist-Train - 20240404112448 - 160 DELHI NCR DETA
33 pages
Question Bank-DA
No ratings yet
Question Bank-DA
5 pages
Siemens PLC Programmer Interview QA
No ratings yet
Siemens PLC Programmer Interview QA
2 pages
Cat4 Ug Eng V6
No ratings yet
Cat4 Ug Eng V6
13 pages
ICT's Impact on Rural Development
No ratings yet
ICT's Impact on Rural Development
8 pages
Cap300dg TL
No ratings yet
Cap300dg TL
8 pages
German A1 Book
No ratings yet
German A1 Book
62 pages
C Functions and Recursion Guide
No ratings yet
C Functions and Recursion Guide
11 pages
BT Reviewer
No ratings yet
BT Reviewer
6 pages
KIOT Employee Management System
100% (1)
KIOT Employee Management System
82 pages
Excel 2019 Practice Assignment Chapter 7 - SIMnet
No ratings yet
Excel 2019 Practice Assignment Chapter 7 - SIMnet
6 pages
Brother LaserPrinter IndivBrochure MFC-L6915DW FA 14nov Highres
No ratings yet
Brother LaserPrinter IndivBrochure MFC-L6915DW FA 14nov Highres
3 pages
C Selection Control Structures Guide
0% (1)
C Selection Control Structures Guide
28 pages
Win11 Manual ENG
No ratings yet
Win11 Manual ENG
84 pages
Molise Catharsis Pharmacy Plan
0% (1)
Molise Catharsis Pharmacy Plan
16 pages
SAP GUI Installation and Configuration
No ratings yet
SAP GUI Installation and Configuration
15 pages
Online Lot Reservation System: Proponets
No ratings yet
Online Lot Reservation System: Proponets
3 pages
Using IntelliCAD
No ratings yet
Using IntelliCAD
644 pages
Deploy Web Apps With Docker
No ratings yet
Deploy Web Apps With Docker
61 pages
GUIA BIM-PEP-Version-3.0-CAP 6-7
No ratings yet
GUIA BIM-PEP-Version-3.0-CAP 6-7
11 pages
GK One Liners
No ratings yet
GK One Liners
3 pages
xczl2011005 - ZXDU68 T601 Power System-V3 - 325632
No ratings yet
xczl2011005 - ZXDU68 T601 Power System-V3 - 325632
2 pages
Top3000 Usb Programmer
No ratings yet
Top3000 Usb Programmer
10 pages
DAV-S500+BR++Ver+1 3
0% (1)
DAV-S500+BR++Ver+1 3
101 pages
Print2CAD 2017 English
No ratings yet
Print2CAD 2017 English
204 pages
Monthly Marketing Reporting Template - HubSpot
No ratings yet
Monthly Marketing Reporting Template - HubSpot
7 pages
MGC Plus Datasheet
No ratings yet
MGC Plus Datasheet
34 pages

CSE 599 I Accelerated Computing - Programming GPUs Lecture 15

Uploaded by

CSE 599 I Accelerated Computing - Programming GPUs Lecture 15

Uploaded by

CSE 599 I

● Introduce dynamic parallelism, a relatively recent CUDA technique in which

An extension to the CUDA programming model which allows a thread to launch

First introduced with the Kepler architecture (2012)

int i = threadIdx.x + blockDim.x * blockIdx.x;

for (int j = starts[i]; j < ends[i]; ++j) {

Without dynamic parallelism

int i = threadIdx.x + blockDim.x * blockIdx.x;

const int N = ends[i] - starts[i];

workDiscoveryChildKernel<<<(N-1)/128+1,128>>>(data + starts[i], N);

__global__ void workDiscoveryChildKernel(float * data, const int N) {

int j = threadIdx.x + blockDim.x * blockIdx.x;

Thread Thread Thread Thread

Thread Thread Thread Thread Thread Thread

Thread Thread Thread

Thread Thread Thread Thread Thread

Without dynamic parallelism With dynamic parallelism

Thread Thread Thread Thread

Thread Thread Thread Thread Thread Thread

Thread Thread Thread

Thread Thread Thread Thread Thread

Without dynamic parallelism With dynamic parallelism

Local memory is private to a thread, and dynamic parallelism is not exception

__global__ void badParentKernel() { __global__ void goodParentKernel(float * data)

__global__ void badParentKernel() { __device__ float value;

Shared memory is private to a block of threads, and dynamic parallelism is no

Parent threads have no privileged access to a child block’s shared memory

In addition to kernel launches, dynamic parallelism allows memory allocation

A few differences about allocating memory from within a kernel:

● Cannot allocate zero-copy memory

The total “nesting depth” allowed with dynamic parallelism is limited to 24

Kernels launched from within a kernel cannot be executed on another GPU

A parent thread is implicitly synchronized with its children before terminating

Parent grid completes

Parent grid Implicit sync

Child grid completes

Blocking all threads can be done by calling cudaDeviceSynchronize() from all

A parent kernel that performs explicit synchronization on a child grid may be

The deepest nesting level at which synchronization is performed is referred to

● The scope of a stream is a block; there can be no sharing of streams

● If no stream is specified, the default stream is used, serializing all kernels

● cudaStreamSynchronize() cannot be called by device code;

● All device streams must be non-blocking. To force awareness of this on the

Currently, only cudaStreamWaitEvent() is allowed to be called from a kernel (no

Events are scoped to the block (like streams)

A Bezier curve is a smooth curved defined by a set of n control points, where n

B(t) = (1-t)2 P0 + 2(1-t)t P1 + t2 P2

A Bezier curve is defined over a continuous domain

We’ll be looking at a kernel to compute a set of discrete points along a

Low curvature: High curvature:

__device__ float computeCurvature(const BezierCurve * curve) {

We’ll be given a curves with controlPoints set, and we want to compute

const float curvature = computeCurvature(&curves[blockIdx.x]);

// compute number of points based on curvature, between 4 and MAX_NUM_POINTS

for (int p = threadIdx.x; p < nVertices; p += blockDim.x) {

const float t = p / (float)(nVertices - 1);

const float oneMinusT = 1.f - t;

float2 position = oneMinusT * oneMinusT * curves[blockIdx.x].controlPoints[0] +

const int i = threadIdx.x + blockDim.x * blockIdx.x;

const float curvature = computeCurvature(&curves[i]);

// compute number of points based on curvature, between 4 and MAX_NUM_POINTS

const int p = threadIdx.x + blockDim.x * blockIdx.x;

const float t = p / (float)(nVertices - 1);

const float oneMinusT = 1.f - t;

float2 position = oneMinusT * oneMinusT * curves[blockIdx.x].controlPoints[0] +

const int i = threadIdx.x + blockDim.x * blockIdx.x;

A quadtree is a tree specially designed for storing 2D points

A quadtree is a tree specially designed for storing 2D points

A quadtree is a tree specially designed for storing 2D points

A quadtree is a tree specially designed for storing 2D points

__shared__ int smem[8];

QuadtreeNode & node = nodes[blockIdx.x];

// recursive base case

const BoundingBox & bbox = node.boundingBox;

const int pointsStart = node.pointsStart;

// do a scan on the number of points for each child to compute offsets

global void workDiscoveryChildKernel(float * data, const int N) {

global void badParentKernel() { global void goodParentKernel(float * data)

global void badParentKernel() { device float value;

device float computeCurvature(const BezierCurve * curve) {

shared int smem[8];