0% found this document useful (0 votes)
177 views15 pages

Speaker - A02 - 5747 - Best Practices in Networking For AI

TWNOG

Uploaded by

3362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views15 pages

Speaker - A02 - 5747 - Best Practices in Networking For AI

TWNOG

Uploaded by

3362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Best Practices in Networking for AI

Sungta Tsai, Sr. Solution Architect | April 2024


Explosive Growth in AI Computational Requirements

Before Transformers = 8x / 2yrs


Transformers = 256x / 2yrs GPT-MoE-1.8T
10,000,000,000

MT NLG 530B
Chinchilla
BLOOM
GPT3-175B
Training Compute (petaFLOPs)

PaLM
100,000,000
Microsoft T-NLG
GPT-2 1.5B

Megatron-NLG
Wav2Vec 2.0
1,000,000 XLNet MoCo ResNet50
Xception

InceptionV3 BERT Large


GPT-1
VGG-19 Resnet Transformer
10,000 ResNeXt
Seq2Seq
ELMo
DenseNet201

AlexNet
100
2012 2014 2016 2018 2020 2022 2024
The Data Center is The Computer
The network defines the data center

AI FACTORY
• Cloud
NVLink + InfiniBand
NVLink + InfiniBand • Multi-tenant
• Variety of small-scale workloads
• Traditional ethernet network can suffice

AI CLOUD • Generative AI Cloud


InfiniBand
InfiniBand
• Multi-tenant
• Variety of workloads including larger scale Generative AI
• Traditional ethernet network for North-South traffic
Spectrum-X
NVIDIA Spectrum-X
AI Ethernet
Ethernet Fabric
Fabric • Spectrum-X ethernet for AI fabric (East-West)
AI
Traditional Ethernet
Traditional Ethernet
• AI Factories
CLOUD • Single or few users
• Extremely large AI models
10 100 1k 10k 100k 1M+ • NVLink and InfiniBand gold standard for AI fabric
# of GPU in Cluster
LLM Compute and Communication Profiling

Representative profile form a large scale LLM training run


Communications is bursty in nature, an average bandwidth
utilization is not a good network criteria
AI Clouds Going Through A Major Change
AI Workloads Require an AI Fabric

Control / User Access Network (N-S) AI Fabric (E-W)

Loosely-Coupled Applications Tightly-Coupled Processes

TCP (Low Bandwidth Flows and Utilization) RDMA (High Bandwidth Flows and Utilization)

High Jitter Tolerance Low Jitter Tolerance

Oversubscribed Topologies Nonblocking Topologies

Heterogeneous Traffic, Statistical Multi-Pathing Bursty Network Capacity, Predictive Performance


Hardware & Software Accelerated In-Network Computing
AI Network Considerations

NCCL Performance With vs Without SHARP


Software Acceleration
(In-network Computing)
NCCL — NVIDIA Collective Communication Library

The SDK library for AI communications - connects the GPUs 1.7X HIGHER
and the network for the AI network operations.
InfiniBand
(with SHARP)

Hardware Acceleration
Best Theoretical
SHARP —Scalable Hierarchical Aggregation and Reduction Protocol Ethernet
Technology performance

SHARP is part of the InfiniBand and NVLink switch ASICs. It enables the
network to perform data reduction operations, an important element of AI
8 16 32 64 128 256 512 1024 2048 4096 8192 16384
workloads. This decreases the amount of data traversing the network and
dramatically reduces collective operations time. Message Size (MiB)

SHARP Aggregation Node: Switch Resident

Host: Data source and Destination


Applied Systems at NVIDIA
What do we do?
Building the Next Generation System at Scale
In practice, what do we look at?

• Cluster Architecture – Fully plan production scale-out deployment. (i.e. layout, design, fabric plan, management,
storage and compute, datacenter…)
• Infrastructure – Build fast, automated workflows – stateless provisioning, automated recovery
• Telemetry = Our eyes and ears in the DC at all times
• Build proper tooling to understand failure rates and causes in order to improve our products
• Understanding incidents and failure conditions to improve quality
• Getting to better performance and value (perf/W)
• FW/SW Integration – Implementing a large and diverse SW stack: OS, users, security
• FW and SW are key in debugging/servicing and need to be integrated in the production flow as much as possible to save time
and costs – One is only as fast as the slowest member of a fleet
• Diagnostics @ scale – Most diagnostics are remote: visual inspection cannot be the primary debug mode
• End to End – All components of NVIDIA technology meet at scale. SW pieces have to work well together (GPU,
networking…) but also with an established ecosystem (Linux, Storage)

Hard equation to solve in production: get alignment on features and timelines for a large ecosystem involving GPU,
libraries, networking, security, engineering teams and users, storage

Ref: NVIDIA GTC: The Next-Generation DGX Architecture for Generative AI [S62421]
Compute InfiniBand Architecture
Fully plan production scale-out deployment

ufmc-eos01 ufmc-eos02
POD 1 POD 2 POD 3 POD 4 POD 5
• Full Fat Tree for performance
• “Rail-optimized” to minimize latency C1.001 … C1.008 … C1.025 … C1.032 C2.001 … C2.032 C3.001 … C3.032 C4.001 … C4.032

and avoid congestion. “Rail” refers to


the network link associated with a
particular GPU index
• Groups of 32 nodes have each rail
connected to a single switch (“Leaf
Rail-Optimized”)
• Rail Groups are made of 4 leaf
switches and 4 spine switches. There
are 8 Rail Groups per POD, one per rail
• The core switches are installed in S1.1
Rail 1
S1.4 S2.1
Rail 2
S2.4
Rail
3
Rail
4
Rail
5
Rail
6
Rail
7
Rail
8 S40.1
Rail 1
S40.4
conjunction with the first 4 POD only, x8
32 switches each. Empty ports are L1.1 L1.4 L2.1 L2.4 L40.1 L40.4

left on the core switches to support 3


additional PODs without recabling the
existing ones

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

EOS[0001-0032] EOS[0033-0064] EOS[0065-0096] EOS[0097-0128] EOS[0129-0256] EOS[0257-0384 EOS[0385-0512] EOS[0513-0640]


POD 1 POD 2 POD 3 POD 4 POD 5
Telemetry
Data Collected

• Management nodes
• Syslog entries
• Standard system metrics such as utilization of CPU, memory, disks, ethernet interfaces, etc.
• Compute nodes
• Syslog entries
• All out-of-band metrics provided by the BMC, including temperature sensors, fan speeds, and CPU/GPU/system power (See
https://docs.nvidia.com/dgx/dgxh100-user-guide/redfish-api-supp.html for details).
• For user jobs which opt-in to additional data collection, standard in-band metrics such as utilization of CPU, memory, disks,
ethernet interfaces, etc. are also collected.
• Shared resources
• Basic network connectivity (ICMP and TCP pings) to systems on the public internet and inside the corporate network
• PDUs (power distribution units)
• CDUs (coolant distribution units)
• InfiniBand fabric per-port traffic utilization and error counters (through UFM)
• Ethernet fabric
• NFS and Lustre filesystems
• Slurm cluster (events; usage of nodes, partitions, accounts)
Telemetry
Tools

• Types:
• In-band - telemetry collection by software running in the OS,
which may result in application overhead
• Out-of-band - telemetry collection through systems outside of
OS, such as a Baseboard Management Controller (BMC), which
does not affect application performance
• Tools
• Prometheus - an open-source database with a relatively strict
data model for efficient storage of telemetry structured data
• exporters - multiple fit for purpose services to provide metrics
that are pulled into the prometheus database
• Grafana - an open-source web-based graphing frontend,
supporting multiple backends including Prometheus
• Splunk - a closed-source tool which includes both a database and
a web-based graphing frontend, optimized for the storing,
search, and visualization of event data.
Telemetry
Tools in NVIDIA

NetQ Flow Analysis UFM – Unified Fabric Manager


Analyze Network Active Traffic Flows Anomaly Prediction
A Turnkey AI Data Center
A logical depiction of SuperPOD
In-Network Computing
Computational Storage
Performance Isolation
Enhanced Telemetry
Zero Trust Security

You might also like