GPU Computing for Machine Learning Systems
Deep Learning
Frameworks
Introduction
Jacob Kahn
This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks Image made with generative AI
Why Build Frameworks for Deep Learning?
● Tensors – operations, handle memory
management
● Hardware acceleration – fast, optimized
implementations
● Hardware agnosticism – run the same program on
multiple platforms (GPU, CPU)
Operating Modes for Deep Learning Frameworks
● Training – adjusts model weights using gradients
from backpropagation
○ Compute-intensive
○ Distributed computation
● Inference – frozen model weights with data flowing
in a single forward pass
○ Static, more optimization
Popular Frameworks
● PyTorch: Dynamic-first, flexible, researcher-friendly
● TensorFlow: Mixes dynamic and declarative, Keras
integration
● Jax: XLA-based, efficient autograd, optimized for
distribution
Factors That Distinguish Deep Learning Frameworks
● Computation Model
○ Defines how tensor programs are executed
○ Influences how models are expressed by implementers
○ Varies based on user goals (researchers, practitioners,
or downstream users
Factors That Distinguish Deep Learning Frameworks
Frontend Language Evolution
● Python replaced fragmented frontends like Lua and C++
● Simplifies implementation across frameworks
● Optimizations for HPC: reduced overhead, better parallelism
Factors That Distinguish Deep Learning Frameworks
Performance
● GPU computation
dominates execution time
● Framework overhead
time spent in framework-specific execution
(rather than GPU)
Factors That Distinguish Deep Learning Frameworks
● Extensibility – supports custom kernels or distributed
computation implementations
● Customization – enhances efficiency in large-scale
datacenter settings
Production Applications and Inference Frameworks
Streamlining Inference
● Runtimes use serialized models, enable static execution
● No autograd or backward pass needed
● Training frameworks remain frontends
Review
● Deep learning framework design
● Factors distinguishing performance
○ computation model
○ frontend language
○ framework overhead
○ customization
○ extensibility
GPU Computing for Machine Learning Systems
Deep Learning
Frameworks
Anatomy of a
Framework
Jacob Kahn
This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks
Image made with generative AI
Fundamental Components of the Training Pipeline
Optimizer step
Forward Backward
Weights Activations Loss Gradients
Input batch
Accelerating Tensors in Deep Learning
Frameworks
Accelerated tensors
● Support various floating-point precisions based on
hardware
● Use optimized primitives for tensor operations
when available
● Real-world GPU tensor ops already in action
Automatic Differentiation in Deep Learning
Frameworks
Automatic differentiation
● Wraps tensor operations for derivative computation
● Record operations to a computation graph – just like
we’ve implemented!
● Compute higher (e.g. second) derivatives, less common
in deep learning
Device Runtimes in Deep Learning Frameworks
Device Runtimes
● Manage computation on devices
(CPUs, accelerators)
● Support for multiple accelerators on a single
host
● APIs for manipulating computation on
accelerators
● Data movement between GPU and CPU
Distributed Computation Primitives
● Distributed Computation support moving data over devices
● Collective communication primitives wrapped in APIs that
operate on tensors
● Data parallelism – automatic gradient synchronization with
AllReduce after wrapping a model and calling backward
Implementing a Neural Module
● Model parallelism – shard a model based on
user-defined parameters (example: layers-per
GPU) or automated heuristics
● Advanced: distributed compilers for
determining sharding
Data Abstractions
● Utilities for loading, preprocessing, and iterating
over samples
● Asynchronous execution to move samples from
CPU to GPU
● Parallelized/threaded data loading to avoid
bottlenecks in execution
Neural Module Abstraction
Module Abstractions
● Encapsulate tensor operations into building blocks
● Include convolutions, linear layers, transformers,
activations
● Built using functional tensor operations
● Forward pass for inference, autograd for backward
Implementing a Neural Module
● Inherit from a module interface
● Define any state and parameters for the module
construction
● Implement the forward function for inference
● Autograd automatically handles parameter
gradients and optimizer updates
Review
● Deep learning framework components
○ tensors
○ autograd
○ device runtimes
○ distributed computation
○ datasets
○ modules
GPU Computing for Machine Learning Systems
Deep Learning
Frameworks
Computation
Models
Jacob Kahn
This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks
Image made with generative AI
What is a Computation Model?
● How do we enqueue computation?
○ How large are kernels? What are they?
● How do we wait on computation?
○ Do we block the host thread? When do we block?
● How much information do we want before launching
computation?
○ What optimizations should we perform?
● Computation model – approach to launch, manage, and wait for
GPU computation
Eager Execution Computation Model
From https://arxiv.org/abs/2201.12465
CPU-GPU Synchronization in Eager Execution
From https://arxiv.org/abs/2201.12465
Benefits of Eager Computation Model
● Flexibility: Supports arbitrary tensor programs,
including those with control flow and dynamism
● Debuggability: Intermediate results are always
available for inspection during
non-synchronization periods
● Simplicity: Individual operators are executed
atomically with no side effects
Inefficiencies in Eager Execution
● CPU-GPU idle time: CPU is idle while GPU is active, leading
to wasted CPU time
● Poor overlap between CPU and GPU computation, slowing
overall program progression
● Kernel launch overhead: Fixed costs for each kernel launch
can be significant, especially for small kernels
Performance vs Benefits of Eager Execution
Eager Execution Trade-offs
● Slower than other computation models
● GPU gains expose CPU-GPU inefficiencies, kernel
launch overhead
● Strengths: Easy debugging, intuitive user experience
● Enables intermediate result inspection, flexible
program expression
Deferred Execution Computation Model
● Deferred Execution: collect operations in a queue/graph,
launch together
● Operator Fusion: Combines ops for efficiency (e.g., t+3+5
→ t+8)
● Kernel Fusion: Merges kernels, improves memory reuse
and in-place ops
Dynamism in Deferred Execution
Maintaining Dynamism
● Operations can still be enqueued and executed
based on control flow
● CPU thread blocks until results are materialized, then
decision-making occurs based on outcomes
● Combines the benefits of eager execution with
performance improvements and reduced overhead
CUDA Graphs for Combining Kernels
● CUDA Graphs: Allow
combining multiple kernels
while retaining discrete
execution
From https://arxiv.org/abs/2201.12465
CUDA Graphs vs Eager Execution
● CUDA Graphs: A form of deferred execution that
buffers kernels
● Kernels are added to a computation graph as they are
received, then executed together
● Provides similar performance benefits as deferred
execution
● Maintains eager execution semantics with discrete
kernels for specific operations
Static Execution Computation Model
● Static Execution: An extended form of deferred
execution, where the user decides how to organize
and launch computation
● Declarative Programming Style: Entire program state,
including control flow, must be explicitly defined
From https://arxiv.org/abs/2201.12465
Constructing Static Computation Graphs
x = tf.placeholder(tf.float32, [None, 10])
h = tf.matmul(x, tf.Variable(tf.zeros([10, 5])))
# Framework-specific if (not Python if)
activation = tf.cond(
tf.greater(tf.reduce_mean(h), 0),
lambda: tf.nn.relu(h),
lambda: tf.nn.tanh(h)
)
# Framework-specific while (not Python while)
_, result = tf.while_loop(
lambda i, acc: tf.less(i, 3),
lambda i, acc: [i + 1, acc + h],
[tf.constant(0), tf.zeros_like(h)]
)
Optimization Opportunities in Static Execution
● Full program specification allows for advanced
optimization opportunities
● Enables optimization in scheduling, memory
usage, and operation fusion
Review
● Computation models in deep learning frameworks
● Performance characteristics and trade-offs of each
model
● Programming models including eager, deferred, and
static execution
GPU Computing for Machine Learning Systems
Deep Learning
Frameworks
Computation
Models:
Framework
Case Study
Jacob Kahn
This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks
Image made with generative AI
Comparison of Computation Models
From https://arxiv.org/abs/2201.12465
Dynamism vs Optimization in Computation Models
Dynamism
From https://arxiv.org/abs/2201.12465
Frameworks and Computation Models
● PyTorch:
○ Initially featured eager execution
○ Introduced CUDA Graphs in PyTorch 1.x to reduce
overhead
○ PyTorch 2.0: Introduced torch.compile, combining
deferred and static execution with optimizations and
dynamic support
TensorFlow and Computation Models
● TensorFlow:
○ Initially featured static execution with explicit graph
construction
○ Control flow (e.g., if statements, loops) implemented via
specific operators
○ Evolved to include deferred and static execution modes
with XLA compiler
○ Deferred/static modes improve performance, especially for
inference without further optimization
Jax and Computation Models
● Jax:
○ Built on top of XLA from the beginning
○ Features both deferred and static execution
modes
○ Maintains dynamism with minimal
abstractions beyond standard Python for
model definition
Evolution of Dynamic Computation Models
Dynamic Computation Models:
● Emerged to meet deep learning research needs
● Preferred for imperative, intuitive programming
● Evolved towards deferred execution for efficiency
● Buffers operations while allowing debugging and
control flow
Review
● Computation models in today’s deep learning
frameworks
● Explored trade-offs between models and usability
● Dynamic, deferred, and static execution impact
performance and programming style
GPU Computing for Machine Learning Systems
Deep Learning
Frameworks
Performance
Jacob Kahn
This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks
Image made with generative AI
Deep Learning Framework Performance
● Language-level overhead
● Kernel launch overhead
● Kernel and compiler quality
● Computation Model
Language-Level Overhead and Performance
● Frontend bottlenecks: GPU execution and C++
internals are faster than frontend languages
(typically Python)
● Overhead Issues: Language-level overhead can
prevent the CPU from dispatching operations
quickly enough to keep up with GPU execution
● Idle GPU: The GPU may be idle while the CPU
executes tensor programs and launches kernels
Kernel-Launch Overhead and Performance
● Fixed Overhead: Kernel launch overhead impacts
small kernels
● Large kernels amortize launch costs, improving
efficiency
● Deferred execution (e.g., CUDA Graphs) minimizes
overhead
● Optimization: Fewer, larger kernels enhance
performance
Kernel/Compiler Quality
High-Quality Kernels:
● Optimized GPU kernels or generated code boost speed
● Faster individual operators improve efficiency
● Compilers optimize memory, fuse ops, and apply global
improvements
● Significant performance gains through compiler
optimizations
Impact of Computation Model on Performance
● Deferred and Static Models: Higher performance
through non-blocking CPU/host threads, optimized
kernels, and batched kernel launches
● Idle GPU Time: Minimizing idle GPU time is a predictor of
overall performance
● GPU Utilization: While correlated with performance, GPU
utilization alone doesn’t fully predict framework
performance
Evolving Frameworks to Overcome Bottlenecks
● GPU Speed vs. Framework Bottlenecks: As GPUs
improve, non-GPU-related overhead becomes more
significant
● Adapting Computation Models: Frameworks evolve
to reduce overhead from non-GPU components
● Python Adaptations:
○ No-GIL: Efforts to remove the Global Interpreter Lock
(GIL) for better multi-threading
○ JIT Compilation: Just-In-Time (JIT) compilation for
performance
Advancements in Compiler Technologies
● Distributed Computation: Improved compiler
technologies for better distribution of computation
● Memory Usage Models: Advanced memory usage
models enable efficient operator ordering and code
generation
● Impact on Performance: Enhances framework
performance on both single GPUs and at scale
Review
● Overhead Types: Language-level, kernel-launch, and GPU
execution overhead
● Computation Models: Deferred and static models can
improve performance
● Framework Adaptation: Efforts to reduce overhead and
improve efficiency as GPUs evolve