0% found this document useful (0 votes)

177 views15 pages

Speaker - A02 - 5747 - Best Practices in Networking For AI

TWNOG

Uploaded by

3362

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

177 views15 pages

Speaker - A02 - 5747 - Best Practices in Networking For AI

TWNOG

Uploaded by

3362

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Best Practices in Networking for AI

Sungta Tsai, Sr. Solution Architect | April 2024

Explosive Growth in AI Computational Requirements

Before Transformers = 8x / 2yrs

Transformers = 256x / 2yrs GPT-MoE-1.8T
10,000,000,000

MT NLG 530B
Chinchilla
BLOOM
GPT3-175B
Training Compute (petaFLOPs)

PaLM
100,000,000
Microsoft T-NLG
GPT-2 1.5B

Megatron-NLG
Wav2Vec 2.0
1,000,000 XLNet MoCo ResNet50
Xception

InceptionV3 BERT Large

GPT-1
VGG-19 Resnet Transformer
10,000 ResNeXt
Seq2Seq
ELMo
DenseNet201

AlexNet
100
2012 2014 2016 2018 2020 2022 2024
The Data Center is The Computer
The network defines the data center

AI FACTORY
• Cloud
NVLink + InfiniBand
NVLink + InfiniBand • Multi-tenant
• Variety of small-scale workloads
• Traditional ethernet network can suffice

AI CLOUD • Generative AI Cloud

InfiniBand
InfiniBand
• Multi-tenant
• Variety of workloads including larger scale Generative AI
• Traditional ethernet network for North-South traffic
Spectrum-X
NVIDIA Spectrum-X
AI Ethernet
Ethernet Fabric
Fabric • Spectrum-X ethernet for AI fabric (East-West)
AI
Traditional Ethernet
Traditional Ethernet
• AI Factories
CLOUD • Single or few users
• Extremely large AI models
10 100 1k 10k 100k 1M+ • NVLink and InfiniBand gold standard for AI fabric
# of GPU in Cluster
LLM Compute and Communication Profiling

Representative profile form a large scale LLM training run

Communications is bursty in nature, an average bandwidth
utilization is not a good network criteria
AI Clouds Going Through A Major Change
AI Workloads Require an AI Fabric

Control / User Access Network (N-S) AI Fabric (E-W)

Loosely-Coupled Applications Tightly-Coupled Processes

TCP (Low Bandwidth Flows and Utilization) RDMA (High Bandwidth Flows and Utilization)

High Jitter Tolerance Low Jitter Tolerance

Oversubscribed Topologies Nonblocking Topologies

Heterogeneous Traffic, Statistical Multi-Pathing Bursty Network Capacity, Predictive Performance

Hardware & Software Accelerated In-Network Computing
AI Network Considerations

NCCL Performance With vs Without SHARP

Software Acceleration
(In-network Computing)
NCCL — NVIDIA Collective Communication Library

The SDK library for AI communications - connects the GPUs 1.7X HIGHER
and the network for the AI network operations.
InfiniBand
(with SHARP)

Hardware Acceleration
Best Theoretical
SHARP —Scalable Hierarchical Aggregation and Reduction Protocol Ethernet
Technology performance

SHARP is part of the InfiniBand and NVLink switch ASICs. It enables the
network to perform data reduction operations, an important element of AI
8 16 32 64 128 256 512 1024 2048 4096 8192 16384
workloads. This decreases the amount of data traversing the network and
dramatically reduces collective operations time. Message Size (MiB)

SHARP Aggregation Node: Switch Resident

Host: Data source and Destination

Applied Systems at NVIDIA
What do we do?
Building the Next Generation System at Scale
In practice, what do we look at?

• Cluster Architecture – Fully plan production scale-out deployment. (i.e. layout, design, fabric plan, management,
storage and compute, datacenter…)
• Infrastructure – Build fast, automated workflows – stateless provisioning, automated recovery
• Telemetry = Our eyes and ears in the DC at all times
• Build proper tooling to understand failure rates and causes in order to improve our products
• Understanding incidents and failure conditions to improve quality
• Getting to better performance and value (perf/W)
• FW/SW Integration – Implementing a large and diverse SW stack: OS, users, security
• FW and SW are key in debugging/servicing and need to be integrated in the production flow as much as possible to save time
and costs – One is only as fast as the slowest member of a fleet
• Diagnostics @ scale – Most diagnostics are remote: visual inspection cannot be the primary debug mode
• End to End – All components of NVIDIA technology meet at scale. SW pieces have to work well together (GPU,
networking…) but also with an established ecosystem (Linux, Storage)

Hard equation to solve in production: get alignment on features and timelines for a large ecosystem involving GPU,
libraries, networking, security, engineering teams and users, storage

Ref: NVIDIA GTC: The Next-Generation DGX Architecture for Generative AI [S62421]
Compute InfiniBand Architecture
Fully plan production scale-out deployment

ufmc-eos01 ufmc-eos02
POD 1 POD 2 POD 3 POD 4 POD 5
• Full Fat Tree for performance
• “Rail-optimized” to minimize latency C1.001 … C1.008 … C1.025 … C1.032 C2.001 … C2.032 C3.001 … C3.032 C4.001 … C4.032

and avoid congestion. “Rail” refers to

the network link associated with a
particular GPU index
• Groups of 32 nodes have each rail
connected to a single switch (“Leaf
Rail-Optimized”)
• Rail Groups are made of 4 leaf
switches and 4 spine switches. There
are 8 Rail Groups per POD, one per rail
• The core switches are installed in S1.1
Rail 1
S1.4 S2.1
Rail 2
S2.4
Rail
3
Rail
4
Rail
5
Rail
6
Rail
7
Rail
8 S40.1
Rail 1
S40.4
conjunction with the first 4 POD only, x8
32 switches each. Empty ports are L1.1 L1.4 L2.1 L2.4 L40.1 L40.4

left on the core switches to support 3

additional PODs without recabling the
existing ones

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

EOS[0001-0032] EOS[0033-0064] EOS[0065-0096] EOS[0097-0128] EOS[0129-0256] EOS[0257-0384 EOS[0385-0512] EOS[0513-0640]

POD 1 POD 2 POD 3 POD 4 POD 5
Telemetry
Data Collected

• Management nodes
• Syslog entries
• Standard system metrics such as utilization of CPU, memory, disks, ethernet interfaces, etc.
• Compute nodes
• Syslog entries
• All out-of-band metrics provided by the BMC, including temperature sensors, fan speeds, and CPU/GPU/system power (See
https://docs.nvidia.com/dgx/dgxh100-user-guide/redfish-api-supp.html for details).
• For user jobs which opt-in to additional data collection, standard in-band metrics such as utilization of CPU, memory, disks,
ethernet interfaces, etc. are also collected.
• Shared resources
• Basic network connectivity (ICMP and TCP pings) to systems on the public internet and inside the corporate network
• PDUs (power distribution units)
• CDUs (coolant distribution units)
• InfiniBand fabric per-port traffic utilization and error counters (through UFM)
• Ethernet fabric
• NFS and Lustre filesystems
• Slurm cluster (events; usage of nodes, partitions, accounts)
Telemetry
Tools

• Types:
• In-band - telemetry collection by software running in the OS,
which may result in application overhead
• Out-of-band - telemetry collection through systems outside of
OS, such as a Baseboard Management Controller (BMC), which
does not affect application performance
• Tools
• Prometheus - an open-source database with a relatively strict
data model for efficient storage of telemetry structured data
• exporters - multiple fit for purpose services to provide metrics
that are pulled into the prometheus database
• Grafana - an open-source web-based graphing frontend,
supporting multiple backends including Prometheus
• Splunk - a closed-source tool which includes both a database and
a web-based graphing frontend, optimized for the storing,
search, and visualization of event data.
Telemetry
Tools in NVIDIA

NetQ Flow Analysis UFM – Unified Fabric Manager

Analyze Network Active Traffic Flows Anomaly Prediction
A Turnkey AI Data Center
A logical depiction of SuperPOD
In-Network Computing
Computational Storage
Performance Isolation
Enhanced Telemetry
Zero Trust Security

Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
LLM Serving Frameworks Deep Dive 3
No ratings yet
LLM Serving Frameworks Deep Dive 3
31 pages
High Performance Computer Networks HPCN - Engineering Science
No ratings yet
High Performance Computer Networks HPCN - Engineering Science
6 pages
Gpu-Applications-Catalog 2021
No ratings yet
Gpu-Applications-Catalog 2021
76 pages
Nvidia DGX Station Print Infographic 738375 Web
No ratings yet
Nvidia DGX Station Print Infographic 738375 Web
1 page
Dgx1 v100 System Architecture Whitepaper
No ratings yet
Dgx1 v100 System Architecture Whitepaper
43 pages
High Performance Network-on-Chip Through MPLS
No ratings yet
High Performance Network-on-Chip Through MPLS
4 pages
HPC Node Performance Simulation
No ratings yet
HPC Node Performance Simulation
33 pages
NGC Registry Launch Technical Overview
No ratings yet
NGC Registry Launch Technical Overview
11 pages
Tesla V100 Performance Guide
No ratings yet
Tesla V100 Performance Guide
23 pages
Uncertainty in Modeling
No ratings yet
Uncertainty in Modeling
25 pages
Module 2 Class 1
No ratings yet
Module 2 Class 1
9 pages
Which GPU(s) To Get For Deep Learning
No ratings yet
Which GPU(s) To Get For Deep Learning
388 pages
Divy HPC
No ratings yet
Divy HPC
36 pages
DGX Solution Stack Whitepaper
No ratings yet
DGX Solution Stack Whitepaper
24 pages
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
No ratings yet
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
5 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Using FFmpeg With NVIDIA GPU Hardware Acceleration
No ratings yet
Using FFmpeg With NVIDIA GPU Hardware Acceleration
22 pages
CC1 Notes: Cloud Computing Overview
No ratings yet
CC1 Notes: Cloud Computing Overview
8 pages
UCF High Performance Computing Overview
No ratings yet
UCF High Performance Computing Overview
19 pages
TB 04631 001 - v01
No ratings yet
TB 04631 001 - v01
25 pages
HPC Impact on Cloud & GPU Computing
No ratings yet
HPC Impact on Cloud & GPU Computing
17 pages
362.00 Nvidia Control Panel Quick Start Guide
No ratings yet
362.00 Nvidia Control Panel Quick Start Guide
33 pages
HPC Datasheet sc23 h200 Datasheet 3002446
100% (1)
HPC Datasheet sc23 h200 Datasheet 3002446
3 pages
S73042 Dynamo Tutorial GTC 2025
No ratings yet
S73042 Dynamo Tutorial GTC 2025
79 pages
Technical Brief: Nvidia Geforce 8800 Gpu Architecture Overview
No ratings yet
Technical Brief: Nvidia Geforce 8800 Gpu Architecture Overview
55 pages
Overview of NVIDIA and Accenture Companies
No ratings yet
Overview of NVIDIA and Accenture Companies
35 pages
2021-02-04 DAIM Company Presentation
No ratings yet
2021-02-04 DAIM Company Presentation
17 pages
Introduction To High Performance Scientific Computing
No ratings yet
Introduction To High Performance Scientific Computing
464 pages
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Machine Learning and AI Workloads Hardware Requirements
No ratings yet
Machine Learning and AI Workloads Hardware Requirements
2 pages
344.48 Nvidia Control Panel Quick Start Guide PDF
No ratings yet
344.48 Nvidia Control Panel Quick Start Guide PDF
33 pages
TensorFlow Lite Micro Embedded Machine L
No ratings yet
TensorFlow Lite Micro Embedded Machine L
13 pages
Nvidia RTX A2000 Datasheet
No ratings yet
Nvidia RTX A2000 Datasheet
1 page
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Using Ffmpeg With Nvidia Gpu Hardware Acceleration: Application Note
No ratings yet
Using Ffmpeg With Nvidia Gpu Hardware Acceleration: Application Note
20 pages
Triton X-100-1
No ratings yet
Triton X-100-1
9 pages
HPC - Manual Kush
No ratings yet
HPC - Manual Kush
36 pages
dgx2 User Guide
No ratings yet
dgx2 User Guide
125 pages
Understanding GPU Architecture and CUDA
No ratings yet
Understanding GPU Architecture and CUDA
12 pages
II Slurm Overview
No ratings yet
II Slurm Overview
52 pages
Open Group
No ratings yet
Open Group
26 pages
High Performance Computing Lecture 2 Parallel Programming With MPI Pub
No ratings yet
High Performance Computing Lecture 2 Parallel Programming With MPI Pub
50 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
Nvidia Story, PDF (1) (2) - 1
No ratings yet
Nvidia Story, PDF (1) (2) - 1
38 pages
Nvidia XID - Errors
No ratings yet
Nvidia XID - Errors
12 pages
High Performance Computing
No ratings yet
High Performance Computing
1 page
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Nvidia Presentation
No ratings yet
Nvidia Presentation
6 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
GPU-Accelerated Graph Analytics with RAPIDS
No ratings yet
GPU-Accelerated Graph Analytics with RAPIDS
33 pages
ML Observability: Key to Model Success
No ratings yet
ML Observability: Key to Model Success
13 pages
Microsoft HPC Pack 2019 Overview
No ratings yet
Microsoft HPC Pack 2019 Overview
21 pages
Data Lake and Serverless Architecture Guide
No ratings yet
Data Lake and Serverless Architecture Guide
83 pages
AI Chips Overview - TPU, NPU, GPU, and FPGA - Pynomial
No ratings yet
AI Chips Overview - TPU, NPU, GPU, and FPGA - Pynomial
9 pages
Intro To AI - Course Notes
No ratings yet
Intro To AI - Course Notes
26 pages
Slurm ParallelCluster AWS
No ratings yet
Slurm ParallelCluster AWS
3 pages
GPU Datasheet
No ratings yet
GPU Datasheet
3 pages
AI Data Center Design Guide
No ratings yet
AI Data Center Design Guide
34 pages
Question AI - Best AI Homework Helper - Going To Questions
No ratings yet
Question AI - Best AI Homework Helper - Going To Questions
1 page
8500 SERIES: Managed Fast Ethernet Switches With Enhanced Security and Layer 2-4 Intelligence
No ratings yet
8500 SERIES: Managed Fast Ethernet Switches With Enhanced Security and Layer 2-4 Intelligence
3 pages
12011832-System Configuration Guide
No ratings yet
12011832-System Configuration Guide
116 pages
Firefighting Robot
No ratings yet
Firefighting Robot
57 pages
User Support Bulletin - BE - P0CQJ - MODEL D Calibration
No ratings yet
User Support Bulletin - BE - P0CQJ - MODEL D Calibration
23 pages
Capstone Project Brief
No ratings yet
Capstone Project Brief
7 pages
Introduction to Software Testing Basics
No ratings yet
Introduction to Software Testing Basics
20 pages
Uzbekistan Food Truck Startup Proposal
No ratings yet
Uzbekistan Food Truck Startup Proposal
2 pages
L19-Block Swap Algorithm
No ratings yet
L19-Block Swap Algorithm
9 pages
Moxa AWK-5232 User Manual
No ratings yet
Moxa AWK-5232 User Manual
81 pages
Harmonized Guide For HND Question Setting PDF
No ratings yet
Harmonized Guide For HND Question Setting PDF
153 pages
AWS White Paper - ISO 6789-2017 v4
100% (1)
AWS White Paper - ISO 6789-2017 v4
4 pages
Cosmic Conspiracy
No ratings yet
Cosmic Conspiracy
4 pages
Voron Trident Manual PT 1
No ratings yet
Voron Trident Manual PT 1
80 pages
CoAP & MQTT
No ratings yet
CoAP & MQTT
24 pages
Softwareegg Sample
No ratings yet
Softwareegg Sample
19 pages
Bus Stand Design Program-1
No ratings yet
Bus Stand Design Program-1
3 pages
HP Printer Setup (Chromebook) - Soporte Al Cliente de HP®
No ratings yet
HP Printer Setup (Chromebook) - Soporte Al Cliente de HP®
3 pages
AI 102 Dump1
100% (1)
AI 102 Dump1
201 pages
653b681b6dfa8e0018412441 - ## - Calendar - DPP 04 (English) (Reasoning)
No ratings yet
653b681b6dfa8e0018412441 - ## - Calendar - DPP 04 (English) (Reasoning)
4 pages
DS Assignment 02 2025
No ratings yet
DS Assignment 02 2025
5 pages
Topic 2-Understand Elements and Principles of Graphic Design
No ratings yet
Topic 2-Understand Elements and Principles of Graphic Design
63 pages
TUG - Platform Selection
No ratings yet
TUG - Platform Selection
38 pages
Data Security Audit CHECKLIST
No ratings yet
Data Security Audit CHECKLIST
5 pages
Absolute Value Transformations Worksheet
0% (1)
Absolute Value Transformations Worksheet
6 pages
MS Word and Excel Practice Exercises
No ratings yet
MS Word and Excel Practice Exercises
4 pages
Slurm Commands Cheat Sheet
No ratings yet
Slurm Commands Cheat Sheet
1 page
Piyush Mahesh Sharma
No ratings yet
Piyush Mahesh Sharma
1 page
LTE Mobility Troubleshooting
No ratings yet
LTE Mobility Troubleshooting
13 pages
Computer Networks Notes
No ratings yet
Computer Networks Notes
45 pages

Speaker - A02 - 5747 - Best Practices in Networking For AI

Uploaded by

Speaker - A02 - 5747 - Best Practices in Networking For AI

Uploaded by

Best Practices in Networking for AI

Sungta Tsai, Sr. Solution Architect | April 2024

Before Transformers = 8x / 2yrs

InceptionV3 BERT Large

AI CLOUD • Generative AI Cloud

Representative profile form a large scale LLM training run

Control / User Access Network (N-S) AI Fabric (E-W)

Loosely-Coupled Applications Tightly-Coupled Processes

High Jitter Tolerance Low Jitter Tolerance

Oversubscribed Topologies Nonblocking Topologies

Heterogeneous Traffic, Statistical Multi-Pathing Bursty Network Capacity, Predictive Performance

NCCL Performance With vs Without SHARP

SHARP Aggregation Node: Switch Resident

Host: Data source and Destination

and avoid congestion. “Rail” refers to

left on the core switches to support 3

EOS[0001-0032] EOS[0033-0064] EOS[0065-0096] EOS[0097-0128] EOS[0129-0256] EOS[0257-0384 EOS[0385-0512] EOS[0513-0640]

NetQ Flow Analysis UFM – Unified Fabric Manager

You might also like