0% found this document useful (0 votes)
517 views29 pages

Spectrum X Technical Overview Munich 1710156600119

The NVIDIA Spectrum-X Networking Platform is designed for optimizing AI workloads in data centers, utilizing advanced technologies such as Adaptive Routing and congestion control to enhance performance and efficiency. It supports a scalable architecture for up to 512K GPUs, integrating both Ethernet and InfiniBand capabilities to cater to different AI cloud and factory environments. The platform also features the Spectrum-4 Ethernet switch and BlueField-3 SuperNIC, providing high bandwidth and low latency for demanding AI applications.

Uploaded by

Alpha Bah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
517 views29 pages

Spectrum X Technical Overview Munich 1710156600119

The NVIDIA Spectrum-X Networking Platform is designed for optimizing AI workloads in data centers, utilizing advanced technologies such as Adaptive Routing and congestion control to enhance performance and efficiency. It supports a scalable architecture for up to 512K GPUs, integrating both Ethernet and InfiniBand capabilities to cater to different AI cloud and factory environments. The platform also features the Spectrum-4 Ethernet switch and BlueField-3 SuperNIC, providing high bandwidth and low latency for demanding AI applications.

Uploaded by

Alpha Bah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

NVIDIA Spectrum-X Networking

Platform
February 2024
The Data Center is The Computer
The network defines the data center

AI FACTORY
• Cloud
NVLink + InfiniBand
• Multi-tenant
• Variety of small-scale workloads
• Traditional Ethernet network can suffice

AI CLOUD
• Generative AI Cloud
InfiniBand
• Multi-tenant
• Variety of workloads including larger scale Generative AI
• Traditional Ethernet network for North-South traffic
NVIDIA Spectrum-X
AI Ethernet Fabric
• NVIDIA Spectrum-X Ethernet for AI fabric (East-West)

Traditional Ethernet
• AI Factories
CLOUD • Single or few users
• Extremely large AI models
10 100 1k 10k 100k 1M+ • NVIDIA NVLink and InfiniBand gold standard for AI fabric
# of GPU in Cluster
NVIDIA Quantum-2 & Spectrum-4 Platform

QUANTUM-2 SWITCH (INFINIBAND)

64-Ports of 400 Gbps or 128-Ports of 200 Gbps

SHARPv3 Small Message Data Reductions

SHARPv3 Large Message Data Reductions

32X More AI Acceleration Engines

CONNECTX-7 INFINIBAND/ETHERNET BLUEFIELD-3 INFINIBAND/ETHERNET

16 Core / 256 Threads Datapath Accelerator 16 Arm A78 64-Bit Cores

Full Transport Offload and Telemetry 16 Core / 256 Threads Datapath Accelerator

Hardware-Based RDMA / GPUDirect Full Transport Offload and Telemetry

MPI Tag Matching and All-to-All Hardware-Based RDMA / GPUDirect


SPECTRUM-4 (ETHERNET)

64-Ports of 800 Gbps or 128-Ports of 400 Gbps MPI and NCCL Accelerations
100 billion transistors, TSMC 4N
Computational Storage
51.2T bandwidth, 100G SerDes

8K GPUs in 2-level, 500K in 3 level Security Engines


What is Spectrum-X

• Maximizing effective bandwidth


• Optimized for AI workloads DOCA
SONiC Cumulus NetQ Air
• Optimized & consistent bandwidth upon any placement Services

• Optimized & consistent bandwidth upon any failure


SAI/SPSDK DOCA Magnum IO

• BlueField-3 SuperNIC
• Programmable Congestion Control
• Packet reordering at 400Gb/s
• Power-efficient, lean design
• Simple, secured Bare-Metal

BlueField-3 SuperNIC BlueField-3 SuperNIC


• Spectrum-4
• Fully featured 51.2Tb/s switch Spectrum-4

• Single shared buffer


• Native RoCE support (Cumulus Linux/SONiC)
• AI focused telemetry (NetQ)
Why Spectrum-X

BlueField-3 BlueField-3 SuperNIC


DPU (B3220) (B3140H)

General Purpose Cloud (North-South) AI Fabric (East-West)


Traditional Ethernet Spectrum-X

North – South
Loosely Coupled Applications Distributed Tightly-Coupled Processing
XX XX XX XX

TCP (Low Bandwidth Flows and Utilization) RoCE (High Bandwidth Flows and Utilization)

Low Jitter Tolerance XX XX XX XX


High Jitter Tolerance
(Long Tail Kills Performance)

Many small flows Few large flows Data Center

East – West
NVIDIA Spectrum4 Ethernet AI Switch

AVAILABILTY EFFICENCY PERFORMANCE STANDARD

100G SerDes in production Power efficient, single ASIC, No HBM Adaptive Routing Technology Standard Ethernet/IP packets and
control plane
51.2T, high radix No overheads, natively Non-Blocking Fine Grain Load Balancing
Fully Disaggregated
Designed for AI at Scale Flexible and scalable Leaf-Spine RoCE optimized for cloud (No need for controller)
Proven with GPUs Scale-out w/ less optics Fully shared buffer – Fairness Open SDK
5th Gen ASIC Future proof: supports >4K 800G GPUs Programable Congestion Control SONiC, Cumulus, Linux Switch
Direct Drive savings – Optics and Copper Congestion Isolation

Low Latency and tail latency

Full Stack Optimization


Israel-1 Spectrum-X Generative AI Cloud

• NVIDIA playground to create test reports on real life AI


applications, benchmarkings and to create Reference
Architectures

• Generative AI Cloud Supercomputer based on the new


Israel-developed NVIDIA Spectrum-X platform

• 256 NVIDIA HGX H100, including 2,048 H100 GPUs in


servers provided by Dell Technologies

• 2,560 BlueField-3 DPUs, 80+ Spectrum-4 Ethernet switches,


all developed in Israel

• Among the world’s most powerful supercomputers, most


powerful supercomputer in Israel

Generative AI Cloud Supercomputer • Peak AI performance of 8-Exaflops

https://www.reuters.com/technology/nvidia-build-israeli-supercomputer-ai-demand-soars-2023-05-29/
ISRAEL-1 Digital Twin
Simulation on NVIDIA Air

• Allows
• Full datapath connectivity
• OOB configuration and ZTP can be tested
• Build Automations to configure CumulusLinux
• Build Automations to configure Hosts Linux
• Validate Connectivity configuration by sending traffic from host
• Topology / Cabling can be modified

• Scale
• Simulation can be duplicate for parallel testing
• Multiple users can cases same simulation

• Next Phases
• Increase level of simulation by running SimX-{Switch, DPU}
• Additional storage network can be added

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.


Agenda
• Problem definition
• What is Spectrum-X?
• Scalability
• Congestion control
• NCCL
• Topologies
• Adaptive routing
• Multitenancy
• Spectrum-X comparison
Problem Definition

• Traditional data center fabric characteristics


• Mixed flow duration and size
• High entropy, ECMP is adequate
• Fabric utilization <50%
• AI Fabric characteristics
• AI Workloads are made of elephant flows
• Very low entropy
• Very high fabric utilization (90% and above)

• Verdict:
• Tools & features used to build traditional data centers do
not give good results for AI workloads
• Nvidia developed an optimized a congestion control
mechanism and a method to distribute traffic across ports
efficiently (Adaptive Routing)
What is Spectrum-X?
Scalability

2 tier fabric

Up to 8K GPU

• Spectrum-4 ASIC

• Tuned for RoCEv2 (focusing on AI/ML workloads),


Lossless
3 tier fabric

• Up to 8K end-points with 2 tier, up to 512K with 3 Up to 512K GPU


tier fabrics
What is Spectrum-X?
Congestion Control

• The AI fabric must be utilized above 90% so we


make sure that we're using all available GPU
resources and no GPU is idle

• DCQCN (Data Center Quantized Congestion


Notification) is an end-to-end congestion
control method for RoCEv2

• Relies on ECN (Explicit Congestion


Notification)

• Responds to congestion notifications and


dynamically adjusts traffic transmit rates

• Industry Standard and de-facto Congestion


Control standard for data centers running
AI/ML workloads
What is Spectrum-X?
Congestion Control – NVIDIA Spectrum-X

• ZTR-CC (Zero Touch RoCE – Congestion Control) is an


NVIDIA developed technology that allows zero touch RoCE
•Congestion Control – NVIDIA Spectrum-X
deployment on switches and provides Round Trip Time
(RTT) based congestion control.
• Timing packets are periodically sent out from sender are
returned immediately by the receiver. This allows
measurement of RTT and can be used to calculate the
congestion and throttle the traffic sent out
• Spectrum-X is using ZTR-CC
• ZTR-CC : RTT + ECN
• RTT measures the time
• ECN marking increases the fidelity of CC signaling
What is Spectrum-X?
What is NCCL (Nvidia Collective Communications Library – aka Nickel)?

• Multi-GPU scaling and parallelism is the purpose of NCCL


o Reducing the training time,
o Scaling up to 10.000s of GPUs efficiently,
o Maximizing inter-GPU bandwidth.

• Topology-aware and is optimized to achieve high bandwidth and low latency


o PCIe + Ethernet / PCIe + IB
o NVLink + Ethernet / NVLink + IB

• For the best performance : tree or ring-style communication collective (alltoall / allreduce)

Topology detection Graph Search CUDA Kernels


Topologies : Understanding Rail-Optimized vs. Non-Rail Optimized
NCCL-optimized performance with Spectrum-X

Rail-Optimized Optimized Topology:


Non-Rail Optimized Topology:
• Defined by GPU connectivity
• Defined by physical proximity (rack)
• NCCL-optimized topology
• Cable-optimized topology
• Optics between leaf and servers
• Lower AI performance
• Higher AI performance
• 3x higher switch latency between GPUs
Non-Rail Optimized Topology:
• LowersDefined
latency between
proximityGPUs
by physical (rack) • Higher spine congestion
Cable-optimized topology
• Reduces spine traffic
Lower AI performance
3x higher switch latency between GPUs
Higher spine congestion
What is Spectrum-X?
Adaptive Routing

• What is Adaptive Routing (AR) and why do we need it?


• Traditional data center traffic provides high entropy (randomness
in IP headers that can result a fair amount of ECMP load
balancing)
• ECMP flow-based hashing does not work with low entropy / high
bandwidth
• AI workloads =>“elephant flows” (low entropy)
• flow collisions => congestion => higher latency/delay
• NVIDIA developed technology
• Needed to dynamically load balance the data with packet
level granularity
• Avoid collisions, congestion and keep delay minimum and
deterministic
• Performed end to end and invisible to the AI
workloads/applications
• Direct Data Placement (DDP) on BlueField-3
ADAPTIVE ROUTING - FUNDAMENTALS
AR Enables Deterministic Performance at Scale

Bandwidth (Gb) vs Time (sec)

Benefits:
• High throughput, short and deterministic job
completion time
• Lower tail latency
• Utilize unused bw in the fabric
• Boosts the AI job completion time

Average Host Throughput (Gb)


What is Spectrum-X?
Multitenancy

• BGP for underlay (numbered links)


• Overlay (VTEP placement) : Whenever
Multitenancy is required (more than 1
customer using AI fabric), using overlay is
recommended: Security, workload
separation is key. For example public
cloud networks are natively Multitenant.
• Can be performed on a Spectrum
switch (Network overlay)
• Can be performed on BlueField Super-
NIC (Host-to-Host overlay)
Spectrum-X Topologies
2 Tier fabric

• If a customer is not planning (in spine spine


foreseeable future) to deploy a cluster
larger than 8K GPUs – use 2 tier fabric
leaf … … leaf
• In case a customer considers expanding
beyond 8K GPUs, but would like to start
with < 8K GPUs we suggest using 3 tier …

fabric (next slide)

SU1 (256 GPU) SU2 (256 GPU)


Spectrum-X Topologies
3 Tier fabric

• For anything larger than 8K GPUs(future proof as well) – use 3 tier fabric

SuperSpine 1 SuperSpine 64
SuperSpine SuperSpine SuperSpine 1 SuperSpine 64
Plane 1 … Plane 64 …

Spine 1 Spine 64 Spine 1 Spine 64


… … …
Leaf 1 Leaf 8 Leaf 1 Leaf 8 Leaf 1 Leaf 8 Leaf 1 Leaf 8
… … … …

… … …
… … … …

SU1 (512 GPU) SU8 (512 GPU) SU993 (512 GPU) SU1000 (512 GPU)
Spectrum-X Optimizes Each Level of the Application Stack
Performance Tuning From the Hardware to the AI Model

Application Specific

Models
GPT, LLAMA,
NEMO, and more

AI Primitives
NCCL-optimized networking

Network Fabric
Accelerated RDMA performance

Hardware
Best-of-breed switch and SuperNIC
General Purpose
AI Fabric Spectrum-X vs Traditional Ethernet
Traditional Ethernet
Spectrum-X

spine spine
spine spine

leaf … … leaf
leaf … … leaf

… …
… …

SU1 (256 GPU) SU2 (256 GPU)


SU1 (256 GPU) SU2 (256 GPU)

• Loosely coupled
• End to end optimized for AI performance
• Load-balancing in the fabric is handled
• Load-balancing in the fabric is handled with
with ECMP
Adaptive Routing (local and remote)
• Standard CC
• Advanced programmable CC
Spectrum-X Performance Update
AI Cloud Workload Performance Isolation

AI Cloud - Nemo LLM 43B isolation AI Cloud - FSDP LLAMA 70B isolation
16 participating + 66 noise 16 participating + 38 noise
2.5 30

Iteration Time (Seconds)


Iteration Time (Seconds)

25
2

Lower is Better
Lower is Better

20
1.5 1.2X faster 15
1 1.7X faster
10
0.5 5
0 0
Traditional Ethernet
Legacy Ethernet Spectrum-X Theoretical Peak Traditional Ethernet
Legacy Ethernet Spectrum-X Theoretical Peak

Spectrum-X accelerates iteration times for training the most common AI models such as Nemo and LLAMA.
Faster training iterations lead to faster job completion times, accelerating time to insight.

Spectrum-X accelerates training AI models in noisy AI Cloud environments.


Spectrum-X vs. InfiniBand
Comparison

Spectrum-X InfiniBand

Ethernet-based implementation of InfiniBand HPC Lowest latency, highest performance, originally designed for
architecture and feature-set adopted to Nvidia Spectrum switches, using
HPC applications and ideal for AI workloads, provides
ROCE
the highest performance in the industry

Address the gap for the customers requiring an Ethernet-based Specific upper layer protocols that are more performant comparing to
AI fabric traditional TCP/UDP and performs superior with NCCL-based
applications
Designed with RoCE in mind and brings configuration Credit-based flow/congestion control and Adaptive
simplicity as well as stability Routing are some of the features native to InfiniBand
Ideal for AI-Cloud use case (Multitenancy environment) Ideal for AI-Factory use case
Learn More About Spectrum-X
Available Resources

NVIDIA Spectrum-X Networking Platform

NVIDIA Spectrum-4 NVIDIA BlueField-3


Ethernet Switch SuperNIC

Spectrum-X Spectrum-X Technical Spectrum-X Technical


Intro Video Webpage Blog Datasheet Whitepaper

Coming Reference Deployment POC


soon! Architecture Guide Guide
Try out the new Spectrum-X NV Air Lab
Spectrum-X – Compute Switch Configuration NV Air Lab

https://air.nvidia.com
SN5400
64x 400GbE (50G Lanes)
NPU NVIDIA Spectrum -4

Switching Capacity • 25.6Tbps 800 GbE USB


PPS In/Out [or 2x400GbE] Type A OOB
• 64x 400GbE: QSFP-DD, 12W Typical
Ports
• 2x 25GbE, SFP28, 2.5W Typical

• SyncE
• Stratum-3E oscillator
Timing • PTP 1588 Profiles
• PPS In/Out
• PTP Master Clock source by external SFP

• x86, Hexa-Core Xeon 2.2GHz


System CPU • RAM: DDR4 SDRAM 32GB
• Image storage: SATA SSD 160GB 25G
Serial Console
System Power • PSU: AC, 1+1 redundancy, hot swap

• H: 2U, 3.43’’ (87mm)


Dimensions • W: 16.8’’ (428mm)
• D: 28.3’’ (720mm)

Airflow • 4 fans, N+1, hot swap, Reverse

• ES: February 2024


Availability
• GA: April 2024
28
AC PS Swappable Fans AC PS
SN5600
64x 800GbE (100G Lanes)

NPU NVIDIA Spectrum -4

Switching Capacity • 51.2Tbps

• 64x 800GbE: OSFP 18W Typical 800 GbE USB


Ports 25G [or 2x400GbE] Type A OOB
• 1x 25GbE, SFP28, 2.5W Typical

• x86, Hexa-Core Xeon 2.2GHz


System CPU • RAM: DDR4 SODIMM 32GB
• Image storage: SATA SSD 160GB

System Power • PSU: AC, 1+1 redundancy, hot swap

• H: 2U, 3.43’’ (87mm)


Dimensions • W: 16.8’’ (428mm)
• D: 28.3’’ (720mm)
Serial Console
Airflow • 4 fans, N+1, hot swap, Reverse

• ES: Available
Availability
• GA: December 2023

29
AC PS Swappable Fans AC PS

You might also like