NVIDIA Spectrum-X Networking
Platform
February 2024
The Data Center is The Computer
The network defines the data center
AI FACTORY
• Cloud
NVLink + InfiniBand
• Multi-tenant
• Variety of small-scale workloads
• Traditional Ethernet network can suffice
AI CLOUD
• Generative AI Cloud
InfiniBand
• Multi-tenant
• Variety of workloads including larger scale Generative AI
• Traditional Ethernet network for North-South traffic
NVIDIA Spectrum-X
AI Ethernet Fabric
• NVIDIA Spectrum-X Ethernet for AI fabric (East-West)
Traditional Ethernet
• AI Factories
CLOUD • Single or few users
• Extremely large AI models
10 100 1k 10k 100k 1M+ • NVIDIA NVLink and InfiniBand gold standard for AI fabric
# of GPU in Cluster
NVIDIA Quantum-2 & Spectrum-4 Platform
QUANTUM-2 SWITCH (INFINIBAND)
64-Ports of 400 Gbps or 128-Ports of 200 Gbps
SHARPv3 Small Message Data Reductions
SHARPv3 Large Message Data Reductions
32X More AI Acceleration Engines
CONNECTX-7 INFINIBAND/ETHERNET BLUEFIELD-3 INFINIBAND/ETHERNET
16 Core / 256 Threads Datapath Accelerator 16 Arm A78 64-Bit Cores
Full Transport Offload and Telemetry 16 Core / 256 Threads Datapath Accelerator
Hardware-Based RDMA / GPUDirect Full Transport Offload and Telemetry
MPI Tag Matching and All-to-All Hardware-Based RDMA / GPUDirect
SPECTRUM-4 (ETHERNET)
64-Ports of 800 Gbps or 128-Ports of 400 Gbps MPI and NCCL Accelerations
100 billion transistors, TSMC 4N
Computational Storage
51.2T bandwidth, 100G SerDes
8K GPUs in 2-level, 500K in 3 level Security Engines
What is Spectrum-X
• Maximizing effective bandwidth
• Optimized for AI workloads DOCA
SONiC Cumulus NetQ Air
• Optimized & consistent bandwidth upon any placement Services
• Optimized & consistent bandwidth upon any failure
SAI/SPSDK DOCA Magnum IO
• BlueField-3 SuperNIC
• Programmable Congestion Control
• Packet reordering at 400Gb/s
• Power-efficient, lean design
• Simple, secured Bare-Metal
BlueField-3 SuperNIC BlueField-3 SuperNIC
• Spectrum-4
• Fully featured 51.2Tb/s switch Spectrum-4
• Single shared buffer
• Native RoCE support (Cumulus Linux/SONiC)
• AI focused telemetry (NetQ)
Why Spectrum-X
BlueField-3 BlueField-3 SuperNIC
DPU (B3220) (B3140H)
General Purpose Cloud (North-South) AI Fabric (East-West)
Traditional Ethernet Spectrum-X
North – South
Loosely Coupled Applications Distributed Tightly-Coupled Processing
XX XX XX XX
TCP (Low Bandwidth Flows and Utilization) RoCE (High Bandwidth Flows and Utilization)
Low Jitter Tolerance XX XX XX XX
High Jitter Tolerance
(Long Tail Kills Performance)
Many small flows Few large flows Data Center
East – West
NVIDIA Spectrum4 Ethernet AI Switch
AVAILABILTY EFFICENCY PERFORMANCE STANDARD
100G SerDes in production Power efficient, single ASIC, No HBM Adaptive Routing Technology Standard Ethernet/IP packets and
control plane
51.2T, high radix No overheads, natively Non-Blocking Fine Grain Load Balancing
Fully Disaggregated
Designed for AI at Scale Flexible and scalable Leaf-Spine RoCE optimized for cloud (No need for controller)
Proven with GPUs Scale-out w/ less optics Fully shared buffer – Fairness Open SDK
5th Gen ASIC Future proof: supports >4K 800G GPUs Programable Congestion Control SONiC, Cumulus, Linux Switch
Direct Drive savings – Optics and Copper Congestion Isolation
Low Latency and tail latency
Full Stack Optimization
Israel-1 Spectrum-X Generative AI Cloud
• NVIDIA playground to create test reports on real life AI
applications, benchmarkings and to create Reference
Architectures
• Generative AI Cloud Supercomputer based on the new
Israel-developed NVIDIA Spectrum-X platform
• 256 NVIDIA HGX H100, including 2,048 H100 GPUs in
servers provided by Dell Technologies
• 2,560 BlueField-3 DPUs, 80+ Spectrum-4 Ethernet switches,
all developed in Israel
• Among the world’s most powerful supercomputers, most
powerful supercomputer in Israel
Generative AI Cloud Supercomputer • Peak AI performance of 8-Exaflops
https://www.reuters.com/technology/nvidia-build-israeli-supercomputer-ai-demand-soars-2023-05-29/
ISRAEL-1 Digital Twin
Simulation on NVIDIA Air
• Allows
• Full datapath connectivity
• OOB configuration and ZTP can be tested
• Build Automations to configure CumulusLinux
• Build Automations to configure Hosts Linux
• Validate Connectivity configuration by sending traffic from host
• Topology / Cabling can be modified
• Scale
• Simulation can be duplicate for parallel testing
• Multiple users can cases same simulation
• Next Phases
• Increase level of simulation by running SimX-{Switch, DPU}
• Additional storage network can be added
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Agenda
• Problem definition
• What is Spectrum-X?
• Scalability
• Congestion control
• NCCL
• Topologies
• Adaptive routing
• Multitenancy
• Spectrum-X comparison
Problem Definition
• Traditional data center fabric characteristics
• Mixed flow duration and size
• High entropy, ECMP is adequate
• Fabric utilization <50%
• AI Fabric characteristics
• AI Workloads are made of elephant flows
• Very low entropy
• Very high fabric utilization (90% and above)
• Verdict:
• Tools & features used to build traditional data centers do
not give good results for AI workloads
• Nvidia developed an optimized a congestion control
mechanism and a method to distribute traffic across ports
efficiently (Adaptive Routing)
What is Spectrum-X?
Scalability
2 tier fabric
Up to 8K GPU
• Spectrum-4 ASIC
• Tuned for RoCEv2 (focusing on AI/ML workloads),
Lossless
3 tier fabric
• Up to 8K end-points with 2 tier, up to 512K with 3 Up to 512K GPU
tier fabrics
What is Spectrum-X?
Congestion Control
• The AI fabric must be utilized above 90% so we
make sure that we're using all available GPU
resources and no GPU is idle
• DCQCN (Data Center Quantized Congestion
Notification) is an end-to-end congestion
control method for RoCEv2
• Relies on ECN (Explicit Congestion
Notification)
• Responds to congestion notifications and
dynamically adjusts traffic transmit rates
• Industry Standard and de-facto Congestion
Control standard for data centers running
AI/ML workloads
What is Spectrum-X?
Congestion Control – NVIDIA Spectrum-X
• ZTR-CC (Zero Touch RoCE – Congestion Control) is an
NVIDIA developed technology that allows zero touch RoCE
•Congestion Control – NVIDIA Spectrum-X
deployment on switches and provides Round Trip Time
(RTT) based congestion control.
• Timing packets are periodically sent out from sender are
returned immediately by the receiver. This allows
measurement of RTT and can be used to calculate the
congestion and throttle the traffic sent out
• Spectrum-X is using ZTR-CC
• ZTR-CC : RTT + ECN
• RTT measures the time
• ECN marking increases the fidelity of CC signaling
What is Spectrum-X?
What is NCCL (Nvidia Collective Communications Library – aka Nickel)?
• Multi-GPU scaling and parallelism is the purpose of NCCL
o Reducing the training time,
o Scaling up to 10.000s of GPUs efficiently,
o Maximizing inter-GPU bandwidth.
• Topology-aware and is optimized to achieve high bandwidth and low latency
o PCIe + Ethernet / PCIe + IB
o NVLink + Ethernet / NVLink + IB
• For the best performance : tree or ring-style communication collective (alltoall / allreduce)
Topology detection Graph Search CUDA Kernels
Topologies : Understanding Rail-Optimized vs. Non-Rail Optimized
NCCL-optimized performance with Spectrum-X
Rail-Optimized Optimized Topology:
Non-Rail Optimized Topology:
• Defined by GPU connectivity
• Defined by physical proximity (rack)
• NCCL-optimized topology
• Cable-optimized topology
• Optics between leaf and servers
• Lower AI performance
• Higher AI performance
• 3x higher switch latency between GPUs
Non-Rail Optimized Topology:
• LowersDefined
latency between
proximityGPUs
by physical (rack) • Higher spine congestion
Cable-optimized topology
• Reduces spine traffic
Lower AI performance
3x higher switch latency between GPUs
Higher spine congestion
What is Spectrum-X?
Adaptive Routing
• What is Adaptive Routing (AR) and why do we need it?
• Traditional data center traffic provides high entropy (randomness
in IP headers that can result a fair amount of ECMP load
balancing)
• ECMP flow-based hashing does not work with low entropy / high
bandwidth
• AI workloads =>“elephant flows” (low entropy)
• flow collisions => congestion => higher latency/delay
• NVIDIA developed technology
• Needed to dynamically load balance the data with packet
level granularity
• Avoid collisions, congestion and keep delay minimum and
deterministic
• Performed end to end and invisible to the AI
workloads/applications
• Direct Data Placement (DDP) on BlueField-3
ADAPTIVE ROUTING - FUNDAMENTALS
AR Enables Deterministic Performance at Scale
Bandwidth (Gb) vs Time (sec)
Benefits:
• High throughput, short and deterministic job
completion time
• Lower tail latency
• Utilize unused bw in the fabric
• Boosts the AI job completion time
Average Host Throughput (Gb)
What is Spectrum-X?
Multitenancy
• BGP for underlay (numbered links)
• Overlay (VTEP placement) : Whenever
Multitenancy is required (more than 1
customer using AI fabric), using overlay is
recommended: Security, workload
separation is key. For example public
cloud networks are natively Multitenant.
• Can be performed on a Spectrum
switch (Network overlay)
• Can be performed on BlueField Super-
NIC (Host-to-Host overlay)
Spectrum-X Topologies
2 Tier fabric
• If a customer is not planning (in spine spine
foreseeable future) to deploy a cluster
larger than 8K GPUs – use 2 tier fabric
leaf … … leaf
• In case a customer considers expanding
beyond 8K GPUs, but would like to start
with < 8K GPUs we suggest using 3 tier …
…
fabric (next slide)
SU1 (256 GPU) SU2 (256 GPU)
Spectrum-X Topologies
3 Tier fabric
• For anything larger than 8K GPUs(future proof as well) – use 3 tier fabric
SuperSpine 1 SuperSpine 64
SuperSpine SuperSpine SuperSpine 1 SuperSpine 64
Plane 1 … Plane 64 …
Spine 1 Spine 64 Spine 1 Spine 64
… … …
Leaf 1 Leaf 8 Leaf 1 Leaf 8 Leaf 1 Leaf 8 Leaf 1 Leaf 8
… … … …
… … …
… … … …
SU1 (512 GPU) SU8 (512 GPU) SU993 (512 GPU) SU1000 (512 GPU)
Spectrum-X Optimizes Each Level of the Application Stack
Performance Tuning From the Hardware to the AI Model
Application Specific
Models
GPT, LLAMA,
NEMO, and more
AI Primitives
NCCL-optimized networking
Network Fabric
Accelerated RDMA performance
Hardware
Best-of-breed switch and SuperNIC
General Purpose
AI Fabric Spectrum-X vs Traditional Ethernet
Traditional Ethernet
Spectrum-X
spine spine
spine spine
leaf … … leaf
leaf … … leaf
… …
… …
SU1 (256 GPU) SU2 (256 GPU)
SU1 (256 GPU) SU2 (256 GPU)
• Loosely coupled
• End to end optimized for AI performance
• Load-balancing in the fabric is handled
• Load-balancing in the fabric is handled with
with ECMP
Adaptive Routing (local and remote)
• Standard CC
• Advanced programmable CC
Spectrum-X Performance Update
AI Cloud Workload Performance Isolation
AI Cloud - Nemo LLM 43B isolation AI Cloud - FSDP LLAMA 70B isolation
16 participating + 66 noise 16 participating + 38 noise
2.5 30
Iteration Time (Seconds)
Iteration Time (Seconds)
25
2
Lower is Better
Lower is Better
20
1.5 1.2X faster 15
1 1.7X faster
10
0.5 5
0 0
Traditional Ethernet
Legacy Ethernet Spectrum-X Theoretical Peak Traditional Ethernet
Legacy Ethernet Spectrum-X Theoretical Peak
Spectrum-X accelerates iteration times for training the most common AI models such as Nemo and LLAMA.
Faster training iterations lead to faster job completion times, accelerating time to insight.
Spectrum-X accelerates training AI models in noisy AI Cloud environments.
Spectrum-X vs. InfiniBand
Comparison
Spectrum-X InfiniBand
Ethernet-based implementation of InfiniBand HPC Lowest latency, highest performance, originally designed for
architecture and feature-set adopted to Nvidia Spectrum switches, using
HPC applications and ideal for AI workloads, provides
ROCE
the highest performance in the industry
Address the gap for the customers requiring an Ethernet-based Specific upper layer protocols that are more performant comparing to
AI fabric traditional TCP/UDP and performs superior with NCCL-based
applications
Designed with RoCE in mind and brings configuration Credit-based flow/congestion control and Adaptive
simplicity as well as stability Routing are some of the features native to InfiniBand
Ideal for AI-Cloud use case (Multitenancy environment) Ideal for AI-Factory use case
Learn More About Spectrum-X
Available Resources
NVIDIA Spectrum-X Networking Platform
NVIDIA Spectrum-4 NVIDIA BlueField-3
Ethernet Switch SuperNIC
Spectrum-X Spectrum-X Technical Spectrum-X Technical
Intro Video Webpage Blog Datasheet Whitepaper
Coming Reference Deployment POC
soon! Architecture Guide Guide
Try out the new Spectrum-X NV Air Lab
Spectrum-X – Compute Switch Configuration NV Air Lab
https://air.nvidia.com
SN5400
64x 400GbE (50G Lanes)
NPU NVIDIA Spectrum -4
Switching Capacity • 25.6Tbps 800 GbE USB
PPS In/Out [or 2x400GbE] Type A OOB
• 64x 400GbE: QSFP-DD, 12W Typical
Ports
• 2x 25GbE, SFP28, 2.5W Typical
• SyncE
• Stratum-3E oscillator
Timing • PTP 1588 Profiles
• PPS In/Out
• PTP Master Clock source by external SFP
• x86, Hexa-Core Xeon 2.2GHz
System CPU • RAM: DDR4 SDRAM 32GB
• Image storage: SATA SSD 160GB 25G
Serial Console
System Power • PSU: AC, 1+1 redundancy, hot swap
• H: 2U, 3.43’’ (87mm)
Dimensions • W: 16.8’’ (428mm)
• D: 28.3’’ (720mm)
Airflow • 4 fans, N+1, hot swap, Reverse
• ES: February 2024
Availability
• GA: April 2024
28
AC PS Swappable Fans AC PS
SN5600
64x 800GbE (100G Lanes)
NPU NVIDIA Spectrum -4
Switching Capacity • 51.2Tbps
• 64x 800GbE: OSFP 18W Typical 800 GbE USB
Ports 25G [or 2x400GbE] Type A OOB
• 1x 25GbE, SFP28, 2.5W Typical
• x86, Hexa-Core Xeon 2.2GHz
System CPU • RAM: DDR4 SODIMM 32GB
• Image storage: SATA SSD 160GB
System Power • PSU: AC, 1+1 redundancy, hot swap
• H: 2U, 3.43’’ (87mm)
Dimensions • W: 16.8’’ (428mm)
• D: 28.3’’ (720mm)
Serial Console
Airflow • 4 fans, N+1, hot swap, Reverse
• ES: Available
Availability
• GA: December 2023
29
AC PS Swappable Fans AC PS