CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
Accelerated Computing -
Programming GPUS
CUDA Dynamic Parallelism
Objective
● Recursive algorithms
● Processing at different levels of detail for different parts of the input (i.e.
irregular grid structure)
● Algorithms in which new work is “uncovered” along the way
Work Discovery Without Dynamic Parallelism
__global__ void workDiscoveryKernel(const int * starts, const int * ends, float * data) {
}
Work Discovery
Thread
0
Thread
1
CPU
Thread
2
Thread
3
if (j < N) {
process(data[j]);
}
}
Work Discovery
CPU CPU
CPU CPU
Parent and child grids have two points of guaranteed global memory
consistency:
1. When the child grid is launched by the parent; all memory operations
performed by the parent thread before launching the child are visible to the
child grid when it starts
2. When the child grid finishes; all memory operations by any thread in the
child grid are visible to the parent thread once the parent thread has
synchronized with the completed child grid
Constant Memory and Dynamic Parallelism
Constant memory also cannot be changed from within a child grid or before
launching a child grid
Thus, all constant memory must be set on the host before launching the parent
kernel and remain constant for the duration of the entire kernel tree
Local Memory and Dynamic Parallelism
Child grids have no privileged access to the parent thread’s local data
Not OK OK
} }
A kernel launched from within a kernel can launch a kernel, which can also
launch a kernel, etc.
There are other limits that tend to come up before the maximum nesting depth
Dynamic Parallelism with Multiple GPUs
The pending launch pool is a buffer that keeps track of kernels that are currently
being executed or waiting to be executed
By default, the pending launch pool has room for 2048 kernels before spilling
into a virtualized pool, which is very slow
Like the device malloc heap size, this limit can be queried or set using
cudaDevice[Get/Set]Limit(), this time with parameter
cudaLimitDevRuntimePendingLaunchCount
Implicit Synchronization
CPU
Child grid
Explicit Synchronization
A parent thread can also explicitly synchronize with child grids using
cudaDeviceSynchronize()
This blocks the calling thread on all child grids created by all threads in the block
This requires storing the entire state of the kernel, i.e. registers, shared memory,
program counters, etc.
Synchronization depth is limited by the size of the backing store, which can be
checked or set using cudaDevice[Get/Set]Limit() and the parameter
cudaLimitDevRuntimeSyncDepth
Streams and Dynamic Parallelism
● Kernels can launch new kernels in both the default and non-default streams
to be executed concurrently
● Child kernels launched in explicit streams must use streams that were
allocated from within the kernel that launched them
Events also have some support in device code, but not the full functionality
Events consume device memory, so there is no limit, but too many events risks
reduced concurrency
Example: Drawing Bezier Curves
For n = 3, the curve is a quadratic Bezier curve defined by control points P0, P1,
and P2, and the following equation:
To make the curve look smooth, we’ll want to compute more points in
high-curvature regions
struct BezierCurve {
float2 controlPoints[3];
float2 vertices[MAX_NUM_POINTS];
int numVertices;
};
if (blockIdx.x < N) {
curves[blockIdx.x].vertices[p] = position;
}
Example: Drawing Bezier Curves
#define MAX_NUM_POINTS 128
struct BezierCurve {
float2 controlPoints[3];
float2 * vertices;
int numVertices;
};
With dynamic parallelism, we won’t need to statically declare the size of the
vertices buffer
Example: Drawing Bezier Curves
__global__ void computeBezierCurvesParentKernel(BezierCurve * curves, const int N) {
if (i < N) {
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
computeBezierCurvesChildKernel<<<(curves[i].numVertices-1)/32+1,32,0,stream>>>(&curves[i]);
cudaStreamDestroy(stream);
}
Example: Drawing Bezier Curves
__global__ void computeBezierCurveChildKernel(BezierCurve * curve) {
if (p < curve->numVertices) {
curve->vertices[p] = position;
}
Example: Drawing Bezier Curves
__global__ void cleanupKernel(BezierCurve * curves, const int N) {
if (i < N) {
cudaFree(curves[i]->vertices);
}
Recursive Example: Quadtrees
Each node represents a square in the plane, and has exactly 4 children, each
representing a quadrant of the square
D
A
A B
C
G
F
E J
G H I
D K
I
H
B
F
C
J
K
Recursive Example: Quadtrees
Each node represents a square in the plane, and has exactly 4 children, each
representing a quadrant of the square
D
A
A B
C
G
F
E J
G H I
D K
?
I
H
B
F
C
J
K
Recursive Example: Quadtrees
Each node represents a square in the plane, and has exactly 4 children, each
representing a quadrant of the square
D
A
A B
C
G
F
E J
G H I
D K
?
I
H
B
F
C
J
K
Recursive Example: Quadtrees
Each node represents a square in the plane, and has exactly 4 children, each
representing a quadrant of the square
D
A
A B
C
G
F
E J
G H I
D K
?
I
H
B
F
C
J
K
Recursive Example: Quadtrees
points: A B C D E F G H I J K
D
A
A B
C
G
F
I
H
K
Recursive Example: Quadtrees
reorder
points: B C D E F G D E G B C F
D
A
A B
C
G
F
E
G
D
I
H
K
Recursive Example: Quadtrees
A D E G B C F H I J K
D
A
0 B
C
G
F
1 9
3 7 8 10
2
I
H
4
6
5
J
K
Recursive Example: Quadtrees
__global__ void buildQuadtreeKernel(QuadtreeNode * nodes, float2 * pointsA, float2 * pointsB, Parameters params) {
// compute number of points for each child and store result in shared memory
countPointsInChildNodes(pointsA + pointsStart, pointsEnd - pointsStart, center, smem);
if (threadIdx.x == blockDim.x - 1) {
cudaMalloc(&node.children, 4 * sizeof(QuadtreeNode)); // allocate memory for the four children
prepareChildren(node, smem); // set bounding boxes, etc. for the children
buildQuadtreeKernel<<<4, blockDim.x>>>(node.children, pointsB, pointsA, params);
}
}
Recursive Algorithms in CUDA Before 2012
https://www.wikipedia.org/
Kirk, David B., and W. Hwu Wen-Mei. Programming massively parallel processors:
a hands-on approach. Morgan Kaufmann, 2016.