Red Hat Summit 2025 - Triton GPU Kernel programming.pdf

Evaluating and building
Triton GPU kernels
Red Hat Ofﬁce of the CTO

Who Are We?
2
Craig Magina - Principal Software Engineer
cmagina@redhat.com
Sanjeev Rampal - Senior Principal Software Engineer
srampal@redhat.com
Steven Royer - Principal Software Engineer
sroyer@redhat.com
Red Hat Ofﬁce of the CTO - Emerging Technologies

Agenda
3
▸ What is Triton?
▸ Proﬁling Triton Kernels with NVIDIA NSight Tools
▸ Kernel Design & Optimization
▸ Q&A
▸ Conclusion

What is Triton?
Research Project
5
Harvard University
https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
Created by Philippe Tillet, H. T. Kung, and David Cox
Presented at Machine Learning and Programming
Languages (MAPL 2019)
Language and Compiler for neural networks
▸ Originally C based language using LLVM IR
▸ Tile oriented rather than thread oriented

What is Triton?
Modern Triton
6
https://github.com/triton-lang/triton
Python based DSL
PyTorch 2.0.0 adopted triton 2023
Triton 3.3.0 released Apr 2025
Multi-vendor hardware support
Accelerate training and inference
Hundreds of contributors

What is Triton?
Community
7
▸ Upstream: https://github.com/triton-lang/triton
▸ Upstream CPU: https://github.com/triton-lang/triton-cpu
▸ GPU Mode: https://github.com/gpu-mode
▸ GPU Mode Discord: https://discord.com/invite/gpumode

What is Triton?
Why Use Triton?
8
Custom kernels
▸ Optimize for speciﬁc use cases - see DeepSeek
▸ Operation fusion - save memory bandwidth
▸ Reduce overhead
Why Triton?
▸ Much easier and faster to write than vendor native kernels
▸ Not vendor locked

What is Triton?
Simple Example
9
Vector: αx + y
https://github.com/redhat-et/tritonblas/blob/main/tritonblas/level1/axpy.py
Autotuner
JIT
Set up
Load data
Calculations
Store results

Proﬁling Triton
Kernels on NVIDIA
GPUs

Profiling Triton Kernels on NVIDIA GPUs
Profiling Triton Kernels
Using NVIDIA Nsight Profiling Tools
▸ NVIDIA Nsight Tools
･ https://developer.nvidia.com/tools-overview
･ Streamer containers for Systems and Compute access using a web browser
･ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/containers/nsight-stre
amer-nsys
･ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/containers/nsight-stre
amer-ncu
▸ Triton Dev Containers
･ https://github.com/redhat-et/triton-dev-containers

Profiling with Nsight Systems
▸ Provides a full system view of
your application across both
the CPU and GPU
▸ Useful for finding kernels that
could be targets for further
profiling
▸ Provides a means of running
Compute on a specific kernel
found in the report
https://developer.nvidia.com/nsight-systems

Profiling with Nsight Compute
▸ Provides detailed metrics for
your application (kernel)
▸ Summary
･ Can filter API calls
･ Select a baseline call for
comparison against
additional reports
▸ Can run Systems, providing a
single place to find kernels
requiring deeper profiling
https://developer.nvidia.com/nsight-systems

NVIDIA Nsight Compute Demo
14
https://red.ht/45eWpGw

NVIDIA Nsight Jupyter Notebook Demo
15
https://red.ht/4dmRchX

GPU Kernel
Design &
Optimization

Kernel optimization techniques
▸ There are several .. in this talk we will focus on two
core techniques
･ Tiled Algorithms
･ E.g. Tiled MatMul
･ Kernel fusion
･ Custom manual Triton kernel fusion
･ Pytorch automated kernel fusion
(numbers are from older GPUs & for illustration only)
Source: https://arxiv.org/abs/2205.14135

Some GPU Architecture concepts
18
Image source 1
Image source 2

Tiled Matrix Multiplication
Image source

Kernel Fusion - PyTorch
▸ Combine 2 or more higher layer functions
(eg PyTorch operations) into a single fused
kernel
▸ Avoid writing back interim results to DRAM
▸ Option 1: Leverage PyTorch’s built-in
intelligent optimization and fused code
generation (mixes & matches Triton,
cuBLas etc)
▸ Example: We want to calculate
･ Max( ReLu ( MatMul(A, B) ) )
･ A, B are large 2-D tensors
# PyTorch Eager approach
def my_torch_matmul_relu(a, b):
c = torch.matmul(a, b)
c = F.leaky_relu(c, negative_slope=0.01)
c_max = torch.max(c)
return c_max
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# torch.compile approach
@torch.compile
def my_torch_matmul_relu_compiled(a, b):
c = torch.matmul(a, b)
c = F.leaky_relu(c, negative_slope=0.01)
c_max = torch.max(c)
return c_max

Kernel Fusion - Triton
▸ Option 2
▸ Manually fused hand crafted
Triton kernel
accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
# We accumulate along the K dimension.
accumulator = tl.dot(a, b, accumulator)
# accumulator = tl.ops(a, b, accumulator)
# Advance the ptrs to the next K block.
a_ptrs += BLOCK_SIZE_K * stride_ak
b_ptrs += BLOCK_SIZE_K * stride_bk
if ACTIVATION == "leaky_relu":
accumulator = leaky_relu(accumulator)
tile_max = tl.max(accumulator)
Tiled matmul
ReLu
Max (stage 1)

Example 1: H100 Max(Relu(Matmul A, B)))
23
Note: These are prelim performance numbers (for illustration purposes only)

Example 2: Max(Relu(Matmul A, B))) on A100, 4070
24
A100 4070

Example 3: Max(Relu(Matmul A, B))) on AMD Instinct 300
25

Notes, Takeaways
26
▸ Start simple .. move to more complex as needed
･ If using PyTorch, start with Eager mode for functional debugging
･ Enable torch.compile mode and profile performance improvements
･ Understand and use correct torch.compile configuration options (eg dynamic, max-autotune)
･ Evaluate (and optionally build upon) Torch Inductor generated code
･ Use the GPU performance profiling tools (NSight, Proton, PyTorch) to identify bottlenecks, room for improvement
･ Develop custom hand crafted Triton kernel(s) if room for further improvement
▸ Review the maturity/ readiness of these options for various GPU family types
･ Test for the specific GPU device and application target parameters (tensor sizes etc)
･ E.g. are specific GPU hardware features fully supported/ enabled by the compiler ? (eg. Tensor cores, TMAs)
▸ Continue to track developments in this space .. new improvements and innovations are
continually arriving at a fast pace

Torch Inductor CodeGen example
27
def call(args):
arg0_1, arg1_1 = args
args.clear()
assert_size_stride(arg0_1, (4096, 4096), (4096, 1))
assert_size_stride(arg1_1, (4096, 4096), (4096, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = empty_strided_cuda((4096, 4096), (4096,
1), torch.float16)
extern_kernels.mm(arg1_1, arg0_1, out=buf0)
del arg0_1
del arg1_1
buf1 = empty_strided_cuda((512, ), (1, ), torch.float32)
stream0 = get_raw_stream(0)
triton_red_fused_leaky_relu_max_0.run(buf0, buf1,
512, 32768, stream=stream0)
del buf0
buf2 = empty_strided_cuda((), (), torch.float16)
stream0 = get_raw_stream(0)
triton_per_fused_leaky_relu_max_1.run(buf1, buf2,
1, 512, stream=stream0)
del buf1
return (buf2, )

linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHat
Thank you

Red Hat Summit 2025 - Triton GPU Kernel programming.pdf

More Related Content

Similar to Red Hat Summit 2025 - Triton GPU Kernel programming.pdf

More from Sanjeev Rampal

Recently uploaded

Red Hat Summit 2025 - Triton GPU Kernel programming.pdf