Launching a Kernel | CUDA

Last Updated : 26 Feb, 2026

In CUDA, a kernel launch is the process of starting parallel execution of a kernel function on the GPU from the Host (CPU). This is done using the Execution Configuration syntax <<< ... >>>, which specifies how many blocks and threads will execute the kernel on the Device (GPU).

Syntax

KernelName<<<blocksPerGrid, threadsPerBlock>>>(arguments);

  • KernelName: Name of the GPU kernel function to execute.
  • threadsPerBlock: The number of threads in each block. This is a user-defined constant (typically a multiple of 32, like 64, 128, 256, etc).
  • blocksPerGrid: The number of blocks required to cover all N operations.
  • Arguments: The data pointers or constants passed to the GPU.

Mathematical Relation for Grid Sizing

When you have N total operations to perform, grid (entire collection of threads) must be large enough to provide at least N threads. Because threads are launched in fixed-size blocks, we use ceiling division to calculate the number of blocks. Below is the formula we use:

B = \frac{N + T - 1}{T}

  • B: Number of blocks required
  • N: Total number of elements
  • T: Threads per block

This formula ensures that if N is not perfectly divisible by T, an extra block is automatically added to handle the "remainder" elements.

C++
int N = 1000;              // Total data elements
int threadsPerBlock = 256;  // Standard threads per block

// Calculate blocks required (Ceiling Division)
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

// Launch the kernel: This creates 4 blocks (1024 total threads)
myKernel<<<blocksPerGrid, threadsPerBlock>>>(data, N);

Multidimensional Thread Organization

CUDA provides the dim3 type to organize threads and blocks in 2D or 3D. This is useful for tasks like image processing, matrices and volumetric data, where data exists in multiple dimensions.

C++
// Define a 2D block of 16x16 threads (256 threads total)
dim3 threadsPerBlock(16, 16); 

// Define a 2D grid of 32x32 blocks
dim3 numBlocks(32, 32); 

// Launch the kernel with the 2D configuration
imageKernel<<<numBlocks, threadsPerBlock>>>(imageData);

Explanation:

  • dim3 threadsPerBlock(16, 16); creates 16 × 16 threads per block (total 256 threads).
  • dim3 numBlocks(32, 32); creates 32 × 32 blocks in the grid.
  • imageKernel<<<numBlocks, threadsPerBlock>>>(imageData); launches the kernel using this 2D parallel configuration.
  • Each thread can access its position using threadIdx.x, threadIdx.y, blockIdx.x, and blockIdx.y.

Example: This example shows a kernel being launched with 4 blocks, where each block contains 8 threads.

C++
%%cuda
#include <stdio.h>

__global__ void checkIndex() {
    // Each thread identifies itself
    printf("Block ID: %d, Thread ID: %d\n", blockIdx.x, threadIdx.x);
}

int main() {
    int blocks = 4;
    int threads_per_block = 8;

    // Launching the kernel
    printf("Launching %d blocks with %d threads each...\n", blocks, threads_per_block);
    
    checkIndex<<<blocks, threads_per_block>>>();

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    return 0;
}

Output

Launching 4 blocks with 8 threads each...
Block ID: 1, Thread ID: 0
Block ID: 1, Thread ID: 1
Block ID: 1, Thread ID: 2
Block ID: 1, Thread ID: 3
Block ID: 1, Thread ID: 4
Block ID: 1, Thread ID: 5
Block ID: 1, Thread ID: 6
Block ID: 1, Thread ID: 7
Block ID: 0, Thread ID: 0
Block ID: 0, Thread ID: 1
Block ID: 0, Thread ID: 2
Block ID: 0, Thread ID: 3
Block ID: 0, Thread ID: 4
Block ID: 0, Thread ID: 5
Block ID: 0, Thread ID: 6
Block ID: 0, Thread ID: 7
Block ID: 3, Thread ID: 0
Block ID: 3, Thread ID: 1
Block ID: 3, Thread ID: 2
Block ID: 3, Thread ID: 3
Block ID: 3, Thread ID: 4
Block ID: 3, Thread ID: 5
Block ID: 3, Thread ID: 6
Block ID: 3, Thread ID: 7
Block ID: 2, Thread ID: 0
Block ID: 2, Thread ID: 1
Block ID: 2, Thread ID: 2
Block ID: 2, Thread ID: 3
Block ID: 2, Thread ID: 4
Block ID: 2, Thread ID: 5
Block ID: 2, Thread ID: 6
Block ID: 2, Thread ID: 7

Explanation:

  • checkIndex<<<4, 8>>>(): This launches the kernel with 4 blocks, each containing 8 threads, resulting in a total of 32 threads (4 × 8).
  • blockIdx.x and threadIdx.x: These are built-in CUDA variables that help each thread identify its block index and thread index, allowing every thread to execute its assigned task correctly.
  • Asynchronous launch: After the <<<...>>> call, CPU does not wait and continues executing the next instruction immediately. The cudaDeviceSynchronize() function is used to pause the CPU until the GPU completes execution.
Comment