Best Practices in Networking for AI
Sungta Tsai, Sr. Solution Architect | April 2024
Explosive Growth in AI Computational Requirements
Before Transformers = 8x / 2yrs
Transformers = 256x / 2yrs GPT-MoE-1.8T
10,000,000,000
MT NLG 530B
Chinchilla
BLOOM
GPT3-175B
Training Compute (petaFLOPs)
PaLM
100,000,000
Microsoft T-NLG
GPT-2 1.5B
Megatron-NLG
Wav2Vec 2.0
1,000,000 XLNet MoCo ResNet50
Xception
InceptionV3 BERT Large
GPT-1
VGG-19 Resnet Transformer
10,000 ResNeXt
Seq2Seq
ELMo
DenseNet201
AlexNet
100
2012 2014 2016 2018 2020 2022 2024
The Data Center is The Computer
The network defines the data center
AI FACTORY
• Cloud
NVLink + InfiniBand
NVLink + InfiniBand • Multi-tenant
• Variety of small-scale workloads
• Traditional ethernet network can suffice
AI CLOUD • Generative AI Cloud
InfiniBand
InfiniBand
• Multi-tenant
• Variety of workloads including larger scale Generative AI
• Traditional ethernet network for North-South traffic
Spectrum-X
NVIDIA Spectrum-X
AI Ethernet
Ethernet Fabric
Fabric • Spectrum-X ethernet for AI fabric (East-West)
AI
Traditional Ethernet
Traditional Ethernet
• AI Factories
CLOUD • Single or few users
• Extremely large AI models
10 100 1k 10k 100k 1M+ • NVLink and InfiniBand gold standard for AI fabric
# of GPU in Cluster
LLM Compute and Communication Profiling
Representative profile form a large scale LLM training run
Communications is bursty in nature, an average bandwidth
utilization is not a good network criteria
AI Clouds Going Through A Major Change
AI Workloads Require an AI Fabric
Control / User Access Network (N-S) AI Fabric (E-W)
Loosely-Coupled Applications Tightly-Coupled Processes
TCP (Low Bandwidth Flows and Utilization) RDMA (High Bandwidth Flows and Utilization)
High Jitter Tolerance Low Jitter Tolerance
Oversubscribed Topologies Nonblocking Topologies
Heterogeneous Traffic, Statistical Multi-Pathing Bursty Network Capacity, Predictive Performance
Hardware & Software Accelerated In-Network Computing
AI Network Considerations
NCCL Performance With vs Without SHARP
Software Acceleration
(In-network Computing)
NCCL — NVIDIA Collective Communication Library
The SDK library for AI communications - connects the GPUs 1.7X HIGHER
and the network for the AI network operations.
InfiniBand
(with SHARP)
Hardware Acceleration
Best Theoretical
SHARP —Scalable Hierarchical Aggregation and Reduction Protocol Ethernet
Technology performance
SHARP is part of the InfiniBand and NVLink switch ASICs. It enables the
network to perform data reduction operations, an important element of AI
8 16 32 64 128 256 512 1024 2048 4096 8192 16384
workloads. This decreases the amount of data traversing the network and
dramatically reduces collective operations time. Message Size (MiB)
SHARP Aggregation Node: Switch Resident
Host: Data source and Destination
Applied Systems at NVIDIA
What do we do?
Building the Next Generation System at Scale
In practice, what do we look at?
• Cluster Architecture – Fully plan production scale-out deployment. (i.e. layout, design, fabric plan, management,
storage and compute, datacenter…)
• Infrastructure – Build fast, automated workflows – stateless provisioning, automated recovery
• Telemetry = Our eyes and ears in the DC at all times
• Build proper tooling to understand failure rates and causes in order to improve our products
• Understanding incidents and failure conditions to improve quality
• Getting to better performance and value (perf/W)
• FW/SW Integration – Implementing a large and diverse SW stack: OS, users, security
• FW and SW are key in debugging/servicing and need to be integrated in the production flow as much as possible to save time
and costs – One is only as fast as the slowest member of a fleet
• Diagnostics @ scale – Most diagnostics are remote: visual inspection cannot be the primary debug mode
• End to End – All components of NVIDIA technology meet at scale. SW pieces have to work well together (GPU,
networking…) but also with an established ecosystem (Linux, Storage)
Hard equation to solve in production: get alignment on features and timelines for a large ecosystem involving GPU,
libraries, networking, security, engineering teams and users, storage
Ref: NVIDIA GTC: The Next-Generation DGX Architecture for Generative AI [S62421]
Compute InfiniBand Architecture
Fully plan production scale-out deployment
ufmc-eos01 ufmc-eos02
POD 1 POD 2 POD 3 POD 4 POD 5
• Full Fat Tree for performance
• “Rail-optimized” to minimize latency C1.001 … C1.008 … C1.025 … C1.032 C2.001 … C2.032 C3.001 … C3.032 C4.001 … C4.032
and avoid congestion. “Rail” refers to
the network link associated with a
particular GPU index
• Groups of 32 nodes have each rail
connected to a single switch (“Leaf
Rail-Optimized”)
• Rail Groups are made of 4 leaf
switches and 4 spine switches. There
are 8 Rail Groups per POD, one per rail
• The core switches are installed in S1.1
Rail 1
S1.4 S2.1
Rail 2
S2.4
Rail
3
Rail
4
Rail
5
Rail
6
Rail
7
Rail
8 S40.1
Rail 1
S40.4
conjunction with the first 4 POD only, x8
32 switches each. Empty ports are L1.1 L1.4 L2.1 L2.4 L40.1 L40.4
left on the core switches to support 3
additional PODs without recabling the
existing ones
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
EOS[0001-0032] EOS[0033-0064] EOS[0065-0096] EOS[0097-0128] EOS[0129-0256] EOS[0257-0384 EOS[0385-0512] EOS[0513-0640]
POD 1 POD 2 POD 3 POD 4 POD 5
Telemetry
Data Collected
• Management nodes
• Syslog entries
• Standard system metrics such as utilization of CPU, memory, disks, ethernet interfaces, etc.
• Compute nodes
• Syslog entries
• All out-of-band metrics provided by the BMC, including temperature sensors, fan speeds, and CPU/GPU/system power (See
https://docs.nvidia.com/dgx/dgxh100-user-guide/redfish-api-supp.html for details).
• For user jobs which opt-in to additional data collection, standard in-band metrics such as utilization of CPU, memory, disks,
ethernet interfaces, etc. are also collected.
• Shared resources
• Basic network connectivity (ICMP and TCP pings) to systems on the public internet and inside the corporate network
• PDUs (power distribution units)
• CDUs (coolant distribution units)
• InfiniBand fabric per-port traffic utilization and error counters (through UFM)
• Ethernet fabric
• NFS and Lustre filesystems
• Slurm cluster (events; usage of nodes, partitions, accounts)
Telemetry
Tools
• Types:
• In-band - telemetry collection by software running in the OS,
which may result in application overhead
• Out-of-band - telemetry collection through systems outside of
OS, such as a Baseboard Management Controller (BMC), which
does not affect application performance
• Tools
• Prometheus - an open-source database with a relatively strict
data model for efficient storage of telemetry structured data
• exporters - multiple fit for purpose services to provide metrics
that are pulled into the prometheus database
• Grafana - an open-source web-based graphing frontend,
supporting multiple backends including Prometheus
• Splunk - a closed-source tool which includes both a database and
a web-based graphing frontend, optimized for the storing,
search, and visualization of event data.
Telemetry
Tools in NVIDIA
NetQ Flow Analysis UFM – Unified Fabric Manager
Analyze Network Active Traffic Flows Anomaly Prediction
A Turnkey AI Data Center
A logical depiction of SuperPOD
In-Network Computing
Computational Storage
Performance Isolation
Enhanced Telemetry
Zero Trust Security