0% found this document useful (0 votes)
11 views53 pages

569 - 10 - Deep Learning Frameworks

This document discusses the importance of deep learning frameworks for GPU computing, highlighting their roles in tensor operations, hardware acceleration, and flexibility across platforms. It covers various computation models such as eager, deferred, and static execution, detailing their performance characteristics and trade-offs. Additionally, it reviews popular frameworks like PyTorch, TensorFlow, and Jax, focusing on how they manage computation and optimize performance.

Uploaded by

derpinking
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views53 pages

569 - 10 - Deep Learning Frameworks

This document discusses the importance of deep learning frameworks for GPU computing, highlighting their roles in tensor operations, hardware acceleration, and flexibility across platforms. It covers various computation models such as eager, deferred, and static execution, detailing their performance characteristics and trade-offs. Additionally, it reviews popular frameworks like PyTorch, TensorFlow, and Jax, focusing on how they manage computation and optimize performance.

Uploaded by

derpinking
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

GPU Computing for Machine Learning Systems

Deep Learning
Frameworks

Introduction

Jacob Kahn

This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks Image made with generative AI


Why Build Frameworks for Deep Learning?
● Tensors – operations, handle memory
management
● Hardware acceleration – fast, optimized
implementations
● Hardware agnosticism – run the same program on
multiple platforms (GPU, CPU)
Operating Modes for Deep Learning Frameworks
● Training – adjusts model weights using gradients
from backpropagation
○ Compute-intensive
○ Distributed computation
● Inference – frozen model weights with data flowing
in a single forward pass
○ Static, more optimization
Popular Frameworks
● PyTorch: Dynamic-first, flexible, researcher-friendly
● TensorFlow: Mixes dynamic and declarative, Keras
integration
● Jax: XLA-based, efficient autograd, optimized for
distribution
Factors That Distinguish Deep Learning Frameworks
● Computation Model
○ Defines how tensor programs are executed
○ Influences how models are expressed by implementers
○ Varies based on user goals (researchers, practitioners,
or downstream users
Factors That Distinguish Deep Learning Frameworks
Frontend Language Evolution
● Python replaced fragmented frontends like Lua and C++
● Simplifies implementation across frameworks
● Optimizations for HPC: reduced overhead, better parallelism
Factors That Distinguish Deep Learning Frameworks
Performance
● GPU computation
dominates execution time
● Framework overhead
time spent in framework-specific execution
(rather than GPU)
Factors That Distinguish Deep Learning Frameworks
● Extensibility – supports custom kernels or distributed
computation implementations
● Customization – enhances efficiency in large-scale
datacenter settings
Production Applications and Inference Frameworks
Streamlining Inference
● Runtimes use serialized models, enable static execution
● No autograd or backward pass needed
● Training frameworks remain frontends
Review
● Deep learning framework design
● Factors distinguishing performance
○ computation model
○ frontend language
○ framework overhead
○ customization
○ extensibility
GPU Computing for Machine Learning Systems

Deep Learning
Frameworks

Anatomy of a
Framework

Jacob Kahn

This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks


Image made with generative AI
Fundamental Components of the Training Pipeline

Optimizer step

Forward Backward
Weights Activations Loss Gradients

Input batch
Accelerating Tensors in Deep Learning
Frameworks
Accelerated tensors
● Support various floating-point precisions based on
hardware
● Use optimized primitives for tensor operations
when available
● Real-world GPU tensor ops already in action
Automatic Differentiation in Deep Learning
Frameworks
Automatic differentiation
● Wraps tensor operations for derivative computation
● Record operations to a computation graph – just like
we’ve implemented!
● Compute higher (e.g. second) derivatives, less common
in deep learning
Device Runtimes in Deep Learning Frameworks
Device Runtimes
● Manage computation on devices
(CPUs, accelerators)
● Support for multiple accelerators on a single
host
● APIs for manipulating computation on
accelerators
● Data movement between GPU and CPU
Distributed Computation Primitives
● Distributed Computation support moving data over devices
● Collective communication primitives wrapped in APIs that
operate on tensors
● Data parallelism – automatic gradient synchronization with
AllReduce after wrapping a model and calling backward
Implementing a Neural Module
● Model parallelism – shard a model based on
user-defined parameters (example: layers-per
GPU) or automated heuristics
● Advanced: distributed compilers for
determining sharding
Data Abstractions
● Utilities for loading, preprocessing, and iterating
over samples
● Asynchronous execution to move samples from
CPU to GPU
● Parallelized/threaded data loading to avoid
bottlenecks in execution
Neural Module Abstraction
Module Abstractions
● Encapsulate tensor operations into building blocks
● Include convolutions, linear layers, transformers,
activations
● Built using functional tensor operations
● Forward pass for inference, autograd for backward
Implementing a Neural Module
● Inherit from a module interface
● Define any state and parameters for the module
construction
● Implement the forward function for inference
● Autograd automatically handles parameter
gradients and optimizer updates
Review
● Deep learning framework components
○ tensors
○ autograd
○ device runtimes
○ distributed computation
○ datasets
○ modules
GPU Computing for Machine Learning Systems

Deep Learning
Frameworks

Computation
Models

Jacob Kahn

This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks


Image made with generative AI
What is a Computation Model?
● How do we enqueue computation?
○ How large are kernels? What are they?

● How do we wait on computation?


○ Do we block the host thread? When do we block?

● How much information do we want before launching


computation?
○ What optimizations should we perform?

● Computation model – approach to launch, manage, and wait for


GPU computation
Eager Execution Computation Model

From https://arxiv.org/abs/2201.12465
CPU-GPU Synchronization in Eager Execution

From https://arxiv.org/abs/2201.12465
Benefits of Eager Computation Model
● Flexibility: Supports arbitrary tensor programs,
including those with control flow and dynamism
● Debuggability: Intermediate results are always
available for inspection during
non-synchronization periods
● Simplicity: Individual operators are executed
atomically with no side effects
Inefficiencies in Eager Execution
● CPU-GPU idle time: CPU is idle while GPU is active, leading
to wasted CPU time
● Poor overlap between CPU and GPU computation, slowing
overall program progression
● Kernel launch overhead: Fixed costs for each kernel launch
can be significant, especially for small kernels
Performance vs Benefits of Eager Execution
Eager Execution Trade-offs
● Slower than other computation models
● GPU gains expose CPU-GPU inefficiencies, kernel
launch overhead
● Strengths: Easy debugging, intuitive user experience
● Enables intermediate result inspection, flexible
program expression
Deferred Execution Computation Model
● Deferred Execution: collect operations in a queue/graph,
launch together
● Operator Fusion: Combines ops for efficiency (e.g., t+3+5
→ t+8)
● Kernel Fusion: Merges kernels, improves memory reuse
and in-place ops
Dynamism in Deferred Execution
Maintaining Dynamism
● Operations can still be enqueued and executed
based on control flow
● CPU thread blocks until results are materialized, then
decision-making occurs based on outcomes
● Combines the benefits of eager execution with
performance improvements and reduced overhead
CUDA Graphs for Combining Kernels

● CUDA Graphs: Allow


combining multiple kernels
while retaining discrete
execution

From https://arxiv.org/abs/2201.12465
CUDA Graphs vs Eager Execution
● CUDA Graphs: A form of deferred execution that
buffers kernels
● Kernels are added to a computation graph as they are
received, then executed together
● Provides similar performance benefits as deferred
execution
● Maintains eager execution semantics with discrete
kernels for specific operations
Static Execution Computation Model
● Static Execution: An extended form of deferred
execution, where the user decides how to organize
and launch computation
● Declarative Programming Style: Entire program state,
including control flow, must be explicitly defined

From https://arxiv.org/abs/2201.12465
Constructing Static Computation Graphs
x = tf.placeholder(tf.float32, [None, 10])
h = tf.matmul(x, tf.Variable(tf.zeros([10, 5])))

# Framework-specific if (not Python if)


activation = tf.cond(
tf.greater(tf.reduce_mean(h), 0),
lambda: tf.nn.relu(h),
lambda: tf.nn.tanh(h)
)

# Framework-specific while (not Python while)


_, result = tf.while_loop(
lambda i, acc: tf.less(i, 3),
lambda i, acc: [i + 1, acc + h],
[tf.constant(0), tf.zeros_like(h)]
)
Optimization Opportunities in Static Execution
● Full program specification allows for advanced
optimization opportunities
● Enables optimization in scheduling, memory
usage, and operation fusion
Review
● Computation models in deep learning frameworks
● Performance characteristics and trade-offs of each
model
● Programming models including eager, deferred, and
static execution
GPU Computing for Machine Learning Systems

Deep Learning
Frameworks

Computation
Models:
Framework
Case Study

Jacob Kahn

This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks


Image made with generative AI
Comparison of Computation Models

From https://arxiv.org/abs/2201.12465
Dynamism vs Optimization in Computation Models
Dynamism

From https://arxiv.org/abs/2201.12465
Frameworks and Computation Models
● PyTorch:
○ Initially featured eager execution
○ Introduced CUDA Graphs in PyTorch 1.x to reduce
overhead
○ PyTorch 2.0: Introduced torch.compile, combining
deferred and static execution with optimizations and
dynamic support
TensorFlow and Computation Models
● TensorFlow:
○ Initially featured static execution with explicit graph
construction
○ Control flow (e.g., if statements, loops) implemented via
specific operators
○ Evolved to include deferred and static execution modes
with XLA compiler
○ Deferred/static modes improve performance, especially for
inference without further optimization
Jax and Computation Models
● Jax:
○ Built on top of XLA from the beginning
○ Features both deferred and static execution
modes
○ Maintains dynamism with minimal
abstractions beyond standard Python for
model definition
Evolution of Dynamic Computation Models
Dynamic Computation Models:
● Emerged to meet deep learning research needs
● Preferred for imperative, intuitive programming
● Evolved towards deferred execution for efficiency
● Buffers operations while allowing debugging and
control flow
Review
● Computation models in today’s deep learning
frameworks
● Explored trade-offs between models and usability
● Dynamic, deferred, and static execution impact
performance and programming style
GPU Computing for Machine Learning Systems

Deep Learning
Frameworks

Performance

Jacob Kahn

This lecture is adapted from https://jacobkahn.me/writing/post/ml_systems_frameworks


Image made with generative AI
Deep Learning Framework Performance
● Language-level overhead
● Kernel launch overhead
● Kernel and compiler quality
● Computation Model
Language-Level Overhead and Performance
● Frontend bottlenecks: GPU execution and C++
internals are faster than frontend languages
(typically Python)
● Overhead Issues: Language-level overhead can
prevent the CPU from dispatching operations
quickly enough to keep up with GPU execution
● Idle GPU: The GPU may be idle while the CPU
executes tensor programs and launches kernels
Kernel-Launch Overhead and Performance
● Fixed Overhead: Kernel launch overhead impacts
small kernels
● Large kernels amortize launch costs, improving
efficiency
● Deferred execution (e.g., CUDA Graphs) minimizes
overhead
● Optimization: Fewer, larger kernels enhance
performance
Kernel/Compiler Quality
High-Quality Kernels:
● Optimized GPU kernels or generated code boost speed
● Faster individual operators improve efficiency
● Compilers optimize memory, fuse ops, and apply global
improvements
● Significant performance gains through compiler
optimizations
Impact of Computation Model on Performance
● Deferred and Static Models: Higher performance
through non-blocking CPU/host threads, optimized
kernels, and batched kernel launches
● Idle GPU Time: Minimizing idle GPU time is a predictor of
overall performance
● GPU Utilization: While correlated with performance, GPU
utilization alone doesn’t fully predict framework
performance
Evolving Frameworks to Overcome Bottlenecks
● GPU Speed vs. Framework Bottlenecks: As GPUs
improve, non-GPU-related overhead becomes more
significant
● Adapting Computation Models: Frameworks evolve
to reduce overhead from non-GPU components
● Python Adaptations:
○ No-GIL: Efforts to remove the Global Interpreter Lock
(GIL) for better multi-threading
○ JIT Compilation: Just-In-Time (JIT) compilation for
performance
Advancements in Compiler Technologies
● Distributed Computation: Improved compiler
technologies for better distribution of computation
● Memory Usage Models: Advanced memory usage
models enable efficient operator ordering and code
generation
● Impact on Performance: Enhances framework
performance on both single GPUs and at scale
Review
● Overhead Types: Language-level, kernel-launch, and GPU
execution overhead
● Computation Models: Deferred and static models can
improve performance
● Framework Adaptation: Efforts to reduce overhead and
improve efficiency as GPUs evolve

You might also like