What is Triton?
ResearchProject
5
Harvard University
https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
Created by Philippe Tillet, H. T. Kung, and David Cox
Presented at Machine Learning and Programming
Languages (MAPL 2019)
Language and Compiler for neural networks
▸ Originally C based language using LLVM IR
▸ Tile oriented rather than thread oriented
6.
What is Triton?
ModernTriton
6
https://github.com/triton-lang/triton
Python based DSL
PyTorch 2.0.0 adopted triton 2023
Triton 3.3.0 released Apr 2025
Multi-vendor hardware support
Accelerate training and inference
Hundreds of contributors
7.
What is Triton?
Community
7
▸Upstream: https://github.com/triton-lang/triton
▸ Upstream CPU: https://github.com/triton-lang/triton-cpu
▸ GPU Mode: https://github.com/gpu-mode
▸ GPU Mode Discord: https://discord.com/invite/gpumode
8.
What is Triton?
WhyUse Triton?
8
Custom kernels
▸ Optimize for specific use cases - see DeepSeek
▸ Operation fusion - save memory bandwidth
▸ Reduce overhead
Why Triton?
▸ Much easier and faster to write than vendor native kernels
▸ Not vendor locked
9.
What is Triton?
SimpleExample
9
Vector: αx + y
https://github.com/redhat-et/tritonblas/blob/main/tritonblas/level1/axpy.py
Autotuner
JIT
Set up
Load data
Calculations
Store results
Profiling Triton Kernelson NVIDIA GPUs
Profiling Triton Kernels
Using NVIDIA Nsight Profiling Tools
▸ NVIDIA Nsight Tools
・ https://developer.nvidia.com/tools-overview
・ Streamer containers for Systems and Compute access using a web browser
・ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/containers/nsight-stre
amer-nsys
・ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/devtools/containers/nsight-stre
amer-ncu
▸ Triton Dev Containers
・ https://github.com/redhat-et/triton-dev-containers
12.
Profiling Triton Kernelson NVIDIA GPUs
Profiling with Nsight Systems
▸ Provides a full system view of
your application across both
the CPU and GPU
▸ Useful for finding kernels that
could be targets for further
profiling
▸ Provides a means of running
Compute on a specific kernel
found in the report
https://developer.nvidia.com/nsight-systems
13.
Profiling Triton Kernelson NVIDIA GPUs
Profiling with Nsight Compute
▸ Provides detailed metrics for
your application (kernel)
▸ Summary
・ Can filter API calls
・ Select a baseline call for
comparison against
additional reports
▸ Can run Systems, providing a
single place to find kernels
requiring deeper profiling
https://developer.nvidia.com/nsight-systems
Kernel optimization techniques
▸There are several .. in this talk we will focus on two
core techniques
・ Tiled Algorithms
・ E.g. Tiled MatMul
・ Kernel fusion
・ Custom manual Triton kernel fusion
・ Pytorch automated kernel fusion
(numbers are from older GPUs & for illustration only)
Source: https://arxiv.org/abs/2205.14135
Kernel Fusion -PyTorch
▸ Combine 2 or more higher layer functions
(eg PyTorch operations) into a single fused
kernel
▸ Avoid writing back interim results to DRAM
▸ Option 1: Leverage PyTorch’s built-in
intelligent optimization and fused code
generation (mixes & matches Triton,
cuBLas etc)
▸ Example: We want to calculate
・ Max( ReLu ( MatMul(A, B) ) )
・ A, B are large 2-D tensors
# PyTorch Eager approach
def my_torch_matmul_relu(a, b):
c = torch.matmul(a, b)
c = F.leaky_relu(c, negative_slope=0.01)
c_max = torch.max(c)
return c_max
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# torch.compile approach
@torch.compile
def my_torch_matmul_relu_compiled(a, b):
c = torch.matmul(a, b)
c = F.leaky_relu(c, negative_slope=0.01)
c_max = torch.max(c)
return c_max
21.
Kernel Fusion -Triton
▸ Option 2
▸ Manually fused hand crafted
Triton kernel
accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
# We accumulate along the K dimension.
accumulator = tl.dot(a, b, accumulator)
# accumulator = tl.ops(a, b, accumulator)
# Advance the ptrs to the next K block.
a_ptrs += BLOCK_SIZE_K * stride_ak
b_ptrs += BLOCK_SIZE_K * stride_bk
if ACTIVATION == "leaky_relu":
accumulator = leaky_relu(accumulator)
tile_max = tl.max(accumulator)
Tiled matmul
ReLu
Max (stage 1)
Example 1: H100Max(Relu(Matmul A, B)))
23
Note: These are prelim performance numbers (for illustration purposes only)
24.
Example 2: Max(Relu(MatmulA, B))) on A100, 4070
24
A100 4070
Note: These are prelim performance numbers (for illustration purposes only)
25.
Example 3: Max(Relu(MatmulA, B))) on AMD Instinct 300
25
Note: These are prelim performance numbers (for illustration purposes only)
26.
Notes, Takeaways
26
▸ Startsimple .. move to more complex as needed
・ If using PyTorch, start with Eager mode for functional debugging
・ Enable torch.compile mode and profile performance improvements
・ Understand and use correct torch.compile configuration options (eg dynamic, max-autotune)
・ Evaluate (and optionally build upon) Torch Inductor generated code
・ Use the GPU performance profiling tools (NSight, Proton, PyTorch) to identify bottlenecks, room for improvement
・ Develop custom hand crafted Triton kernel(s) if room for further improvement
▸ Review the maturity/ readiness of these options for various GPU family types
・ Test for the specific GPU device and application target parameters (tensor sizes etc)
・ E.g. are specific GPU hardware features fully supported/ enabled by the compiler ? (eg. Tensor cores, TMAs)
▸ Continue to track developments in this space .. new improvements and innovations are
continually arriving at a fast pace