Evaluating and building
Triton GPU kernels
Red Hat Office of the CTO
Who Are We?
2
Craig Magina - Principal Software Engineer
cmagina@redhat.com
Sanjeev Rampal - Senior Principal Software Engineer
srampal@redhat.com
Steven Royer - Principal Software Engineer
sroyer@redhat.com
Red Hat Office of the CTO - Emerging Technologies
Agenda
3
▸ What is Triton?
▸ Profiling Triton Kernels with NVIDIA NSight Tools
▸ Kernel Design & Optimization
▸ Q&A
▸ Conclusion
What is Triton?
What is Triton?
Research Project
5
Harvard University
https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
Created by Philippe Tillet, H. T. Kung, and David Cox
Presented at Machine Learning and Programming
Languages (MAPL 2019)
Language and Compiler for neural networks
▸ Originally C based language using LLVM IR
▸ Tile oriented rather than thread oriented
What is Triton?
Modern Triton
6
https://github.com/triton-lang/triton
Python based DSL
PyTorch 2.0.0 adopted triton 2023
Triton 3.3.0 released Apr 2025
Multi-vendor hardware support
Accelerate training and inference
Hundreds of contributors
What is Triton?
Community
7
▸ Upstream: https://github.com/triton-lang/triton
▸ Upstream CPU: https://github.com/triton-lang/triton-cpu
▸ GPU Mode: https://github.com/gpu-mode
▸ GPU Mode Discord: https://discord.com/invite/gpumode
What is Triton?
Why Use Triton?
8
Custom kernels
▸ Optimize for specific use cases - see DeepSeek
▸ Operation fusion - save memory bandwidth
▸ Reduce overhead
Why Triton?
▸ Much easier and faster to write than vendor native kernels
▸ Not vendor locked
What is Triton?
Simple Example
9
Vector: αx + y
https://github.com/redhat-et/tritonblas/blob/main/tritonblas/level1/axpy.py
Autotuner
JIT
Set up
Load data
Calculations
Store results
Profiling Triton
Kernels on NVIDIA
GPUs
Profiling Triton Kernels on NVIDIA GPUs
Profiling Triton Kernels
Using NVIDIA Nsight Profiling Tools
▸ NVIDIA Nsight Tools
・ https://developer.nvidia.com/tools-overview
・ Streamer containers for Systems and Compute access using a web browser
・ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/containers/nsight-stre
amer-nsys
・ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/containers/nsight-stre
amer-ncu
▸ Triton Dev Containers
・ https://github.com/redhat-et/triton-dev-containers
Profiling Triton Kernels on NVIDIA GPUs
Profiling with Nsight Systems
▸ Provides a full system view of
your application across both
the CPU and GPU
▸ Useful for finding kernels that
could be targets for further
profiling
▸ Provides a means of running
Compute on a specific kernel
found in the report
https://developer.nvidia.com/nsight-systems
Profiling Triton Kernels on NVIDIA GPUs
Profiling with Nsight Compute
▸ Provides detailed metrics for
your application (kernel)
▸ Summary
・ Can filter API calls
・ Select a baseline call for
comparison against
additional reports
▸ Can run Systems, providing a
single place to find kernels
requiring deeper profiling
https://developer.nvidia.com/nsight-systems
Profiling Triton Kernels on NVIDIA GPUs
NVIDIA Nsight Compute Demo
14
https://red.ht/45eWpGw
Profiling Triton Kernels on NVIDIA GPUs
NVIDIA Nsight Jupyter Notebook Demo
15
https://red.ht/4dmRchX
GPU Kernel
Design &
Optimization
Kernel optimization techniques
▸ There are several .. in this talk we will focus on two
core techniques
・ Tiled Algorithms
・ E.g. Tiled MatMul
・ Kernel fusion
・ Custom manual Triton kernel fusion
・ Pytorch automated kernel fusion
(numbers are from older GPUs & for illustration only)
Source: https://arxiv.org/abs/2205.14135
Some GPU Architecture concepts
18
Image source 1
Image source 2
Tiled Matrix Multiplication
Image source
Kernel Fusion - PyTorch
▸ Combine 2 or more higher layer functions
(eg PyTorch operations) into a single fused
kernel
▸ Avoid writing back interim results to DRAM
▸ Option 1: Leverage PyTorch’s built-in
intelligent optimization and fused code
generation (mixes & matches Triton,
cuBLas etc)
▸ Example: We want to calculate
・ Max( ReLu ( MatMul(A, B) ) )
・ A, B are large 2-D tensors
# PyTorch Eager approach
def my_torch_matmul_relu(a, b):
c = torch.matmul(a, b)
c = F.leaky_relu(c, negative_slope=0.01)
c_max = torch.max(c)
return c_max
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# torch.compile approach
@torch.compile
def my_torch_matmul_relu_compiled(a, b):
c = torch.matmul(a, b)
c = F.leaky_relu(c, negative_slope=0.01)
c_max = torch.max(c)
return c_max
Kernel Fusion - Triton
▸ Option 2
▸ Manually fused hand crafted
Triton kernel
accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
# We accumulate along the K dimension.
accumulator = tl.dot(a, b, accumulator)
# accumulator = tl.ops(a, b, accumulator)
# Advance the ptrs to the next K block.
a_ptrs += BLOCK_SIZE_K * stride_ak
b_ptrs += BLOCK_SIZE_K * stride_bk
if ACTIVATION == "leaky_relu":
accumulator = leaky_relu(accumulator)
tile_max = tl.max(accumulator)
Tiled matmul
ReLu
Max (stage 1)
Demo
22
Demo Jupyter Notebook
Example 1: H100 Max(Relu(Matmul A, B)))
23
Note: These are prelim performance numbers (for illustration purposes only)
Example 2: Max(Relu(Matmul A, B))) on A100, 4070
24
A100 4070
Note: These are prelim performance numbers (for illustration purposes only)
Example 3: Max(Relu(Matmul A, B))) on AMD Instinct 300
25
Note: These are prelim performance numbers (for illustration purposes only)
Notes, Takeaways
26
▸ Start simple .. move to more complex as needed
・ If using PyTorch, start with Eager mode for functional debugging
・ Enable torch.compile mode and profile performance improvements
・ Understand and use correct torch.compile configuration options (eg dynamic, max-autotune)
・ Evaluate (and optionally build upon) Torch Inductor generated code
・ Use the GPU performance profiling tools (NSight, Proton, PyTorch) to identify bottlenecks, room for improvement
・ Develop custom hand crafted Triton kernel(s) if room for further improvement
▸ Review the maturity/ readiness of these options for various GPU family types
・ Test for the specific GPU device and application target parameters (tensor sizes etc)
・ E.g. are specific GPU hardware features fully supported/ enabled by the compiler ? (eg. Tensor cores, TMAs)
▸ Continue to track developments in this space .. new improvements and innovations are
continually arriving at a fast pace
Torch Inductor CodeGen example
27
def call(args):
arg0_1, arg1_1 = args
args.clear()
assert_size_stride(arg0_1, (4096, 4096), (4096, 1))
assert_size_stride(arg1_1, (4096, 4096), (4096, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = empty_strided_cuda((4096, 4096), (4096,
1), torch.float16)
extern_kernels.mm(arg1_1, arg0_1, out=buf0)
del arg0_1
del arg1_1
buf1 = empty_strided_cuda((512, ), (1, ), torch.float32)
stream0 = get_raw_stream(0)
triton_red_fused_leaky_relu_max_0.run(buf0, buf1,
512, 32768, stream=stream0)
del buf0
buf2 = empty_strided_cuda((), (), torch.float16)
stream0 = get_raw_stream(0)
triton_per_fused_leaky_relu_max_1.run(buf1, buf2,
1, 512, stream=stream0)
del buf1
return (buf2, )
Q&A
Conclusion
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHat
Thank you

Red Hat Summit 2025 - Triton GPU Kernel programming.pdf

  • 1.
    Evaluating and building TritonGPU kernels Red Hat Office of the CTO
  • 2.
    Who Are We? 2 CraigMagina - Principal Software Engineer [email protected] Sanjeev Rampal - Senior Principal Software Engineer [email protected] Steven Royer - Principal Software Engineer [email protected] Red Hat Office of the CTO - Emerging Technologies
  • 3.
    Agenda 3 ▸ What isTriton? ▸ Profiling Triton Kernels with NVIDIA NSight Tools ▸ Kernel Design & Optimization ▸ Q&A ▸ Conclusion
  • 4.
  • 5.
    What is Triton? ResearchProject 5 Harvard University https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf Created by Philippe Tillet, H. T. Kung, and David Cox Presented at Machine Learning and Programming Languages (MAPL 2019) Language and Compiler for neural networks ▸ Originally C based language using LLVM IR ▸ Tile oriented rather than thread oriented
  • 6.
    What is Triton? ModernTriton 6 https://github.com/triton-lang/triton Python based DSL PyTorch 2.0.0 adopted triton 2023 Triton 3.3.0 released Apr 2025 Multi-vendor hardware support Accelerate training and inference Hundreds of contributors
  • 7.
    What is Triton? Community 7 ▸Upstream: https://github.com/triton-lang/triton ▸ Upstream CPU: https://github.com/triton-lang/triton-cpu ▸ GPU Mode: https://github.com/gpu-mode ▸ GPU Mode Discord: https://discord.com/invite/gpumode
  • 8.
    What is Triton? WhyUse Triton? 8 Custom kernels ▸ Optimize for specific use cases - see DeepSeek ▸ Operation fusion - save memory bandwidth ▸ Reduce overhead Why Triton? ▸ Much easier and faster to write than vendor native kernels ▸ Not vendor locked
  • 9.
    What is Triton? SimpleExample 9 Vector: αx + y https://github.com/redhat-et/tritonblas/blob/main/tritonblas/level1/axpy.py Autotuner JIT Set up Load data Calculations Store results
  • 10.
  • 11.
    Profiling Triton Kernelson NVIDIA GPUs Profiling Triton Kernels Using NVIDIA Nsight Profiling Tools ▸ NVIDIA Nsight Tools ・ https://developer.nvidia.com/tools-overview ・ Streamer containers for Systems and Compute access using a web browser ・ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/containers/nsight-stre amer-nsys ・ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/containers/nsight-stre amer-ncu ▸ Triton Dev Containers ・ https://github.com/redhat-et/triton-dev-containers
  • 12.
    Profiling Triton Kernelson NVIDIA GPUs Profiling with Nsight Systems ▸ Provides a full system view of your application across both the CPU and GPU ▸ Useful for finding kernels that could be targets for further profiling ▸ Provides a means of running Compute on a specific kernel found in the report https://developer.nvidia.com/nsight-systems
  • 13.
    Profiling Triton Kernelson NVIDIA GPUs Profiling with Nsight Compute ▸ Provides detailed metrics for your application (kernel) ▸ Summary ・ Can filter API calls ・ Select a baseline call for comparison against additional reports ▸ Can run Systems, providing a single place to find kernels requiring deeper profiling https://developer.nvidia.com/nsight-systems
  • 14.
    Profiling Triton Kernelson NVIDIA GPUs NVIDIA Nsight Compute Demo 14 https://red.ht/45eWpGw
  • 15.
    Profiling Triton Kernelson NVIDIA GPUs NVIDIA Nsight Jupyter Notebook Demo 15 https://red.ht/4dmRchX
  • 16.
  • 17.
    Kernel optimization techniques ▸There are several .. in this talk we will focus on two core techniques ・ Tiled Algorithms ・ E.g. Tiled MatMul ・ Kernel fusion ・ Custom manual Triton kernel fusion ・ Pytorch automated kernel fusion (numbers are from older GPUs & for illustration only) Source: https://arxiv.org/abs/2205.14135
  • 18.
    Some GPU Architectureconcepts 18 Image source 1 Image source 2
  • 19.
  • 20.
    Kernel Fusion -PyTorch ▸ Combine 2 or more higher layer functions (eg PyTorch operations) into a single fused kernel ▸ Avoid writing back interim results to DRAM ▸ Option 1: Leverage PyTorch’s built-in intelligent optimization and fused code generation (mixes & matches Triton, cuBLas etc) ▸ Example: We want to calculate ・ Max( ReLu ( MatMul(A, B) ) ) ・ A, B are large 2-D tensors # PyTorch Eager approach def my_torch_matmul_relu(a, b): c = torch.matmul(a, b) c = F.leaky_relu(c, negative_slope=0.01) c_max = torch.max(c) return c_max >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # torch.compile approach @torch.compile def my_torch_matmul_relu_compiled(a, b): c = torch.matmul(a, b) c = F.leaky_relu(c, negative_slope=0.01) c_max = torch.max(c) return c_max
  • 21.
    Kernel Fusion -Triton ▸ Option 2 ▸ Manually fused hand crafted Triton kernel accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)): a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0) b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0) # We accumulate along the K dimension. accumulator = tl.dot(a, b, accumulator) # accumulator = tl.ops(a, b, accumulator) # Advance the ptrs to the next K block. a_ptrs += BLOCK_SIZE_K * stride_ak b_ptrs += BLOCK_SIZE_K * stride_bk if ACTIVATION == "leaky_relu": accumulator = leaky_relu(accumulator) tile_max = tl.max(accumulator) Tiled matmul ReLu Max (stage 1)
  • 22.
  • 23.
    Example 1: H100Max(Relu(Matmul A, B))) 23 Note: These are prelim performance numbers (for illustration purposes only)
  • 24.
    Example 2: Max(Relu(MatmulA, B))) on A100, 4070 24 A100 4070 Note: These are prelim performance numbers (for illustration purposes only)
  • 25.
    Example 3: Max(Relu(MatmulA, B))) on AMD Instinct 300 25 Note: These are prelim performance numbers (for illustration purposes only)
  • 26.
    Notes, Takeaways 26 ▸ Startsimple .. move to more complex as needed ・ If using PyTorch, start with Eager mode for functional debugging ・ Enable torch.compile mode and profile performance improvements ・ Understand and use correct torch.compile configuration options (eg dynamic, max-autotune) ・ Evaluate (and optionally build upon) Torch Inductor generated code ・ Use the GPU performance profiling tools (NSight, Proton, PyTorch) to identify bottlenecks, room for improvement ・ Develop custom hand crafted Triton kernel(s) if room for further improvement ▸ Review the maturity/ readiness of these options for various GPU family types ・ Test for the specific GPU device and application target parameters (tensor sizes etc) ・ E.g. are specific GPU hardware features fully supported/ enabled by the compiler ? (eg. Tensor cores, TMAs) ▸ Continue to track developments in this space .. new improvements and innovations are continually arriving at a fast pace
  • 27.
    Torch Inductor CodeGenexample 27 def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (4096, 4096), (4096, 1)) assert_size_stride(arg1_1, (4096, 4096), (4096, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((4096, 4096), (4096, 1), torch.float16) extern_kernels.mm(arg1_1, arg0_1, out=buf0) del arg0_1 del arg1_1 buf1 = empty_strided_cuda((512, ), (1, ), torch.float32) stream0 = get_raw_stream(0) triton_red_fused_leaky_relu_max_0.run(buf0, buf1, 512, 32768, stream=stream0) del buf0 buf2 = empty_strided_cuda((), (), torch.float16) stream0 = get_raw_stream(0) triton_per_fused_leaky_relu_max_1.run(buf1, buf2, 1, 512, stream=stream0) del buf1 return (buf2, )
  • 28.
  • 29.
  • 30.