0% found this document useful (0 votes)

517 views29 pages

Spectrum X Technical Overview Munich 1710156600119

The NVIDIA Spectrum-X Networking Platform is designed for optimizing AI workloads in data centers, utilizing advanced technologies such as Adaptive Routing and congestion control to enhance performance and efficiency. It supports a scalable architecture for up to 512K GPUs, integrating both Ethernet and InfiniBand capabilities to cater to different AI cloud and factory environments. The platform also features the Spectrum-4 Ethernet switch and BlueField-3 SuperNIC, providing high bandwidth and low latency for demanding AI applications.

Uploaded by

Alpha Bah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

517 views29 pages

Spectrum X Technical Overview Munich 1710156600119

Uploaded by

Alpha Bah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

NVIDIA Spectrum-X Networking

Platform
February 2024
The Data Center is The Computer
The network defines the data center

AI FACTORY
• Cloud
NVLink + InfiniBand
• Multi-tenant
• Variety of small-scale workloads
• Traditional Ethernet network can suffice

AI CLOUD
• Generative AI Cloud
InfiniBand
• Multi-tenant
• Variety of workloads including larger scale Generative AI
• Traditional Ethernet network for North-South traffic
NVIDIA Spectrum-X
AI Ethernet Fabric
• NVIDIA Spectrum-X Ethernet for AI fabric (East-West)

Traditional Ethernet
• AI Factories
CLOUD • Single or few users
• Extremely large AI models
10 100 1k 10k 100k 1M+ • NVIDIA NVLink and InfiniBand gold standard for AI fabric
# of GPU in Cluster
NVIDIA Quantum-2 & Spectrum-4 Platform

QUANTUM-2 SWITCH (INFINIBAND)

64-Ports of 400 Gbps or 128-Ports of 200 Gbps

SHARPv3 Small Message Data Reductions

SHARPv3 Large Message Data Reductions

32X More AI Acceleration Engines

CONNECTX-7 INFINIBAND/ETHERNET BLUEFIELD-3 INFINIBAND/ETHERNET

16 Core / 256 Threads Datapath Accelerator 16 Arm A78 64-Bit Cores

Full Transport Offload and Telemetry 16 Core / 256 Threads Datapath Accelerator

Hardware-Based RDMA / GPUDirect Full Transport Offload and Telemetry

MPI Tag Matching and All-to-All Hardware-Based RDMA / GPUDirect

SPECTRUM-4 (ETHERNET)

64-Ports of 800 Gbps or 128-Ports of 400 Gbps MPI and NCCL Accelerations
100 billion transistors, TSMC 4N
Computational Storage
51.2T bandwidth, 100G SerDes

8K GPUs in 2-level, 500K in 3 level Security Engines

What is Spectrum-X

• Maximizing effective bandwidth

• Optimized for AI workloads DOCA
SONiC Cumulus NetQ Air
• Optimized & consistent bandwidth upon any placement Services

• Optimized & consistent bandwidth upon any failure

SAI/SPSDK DOCA Magnum IO

• BlueField-3 SuperNIC
• Programmable Congestion Control
• Packet reordering at 400Gb/s
• Power-efficient, lean design
• Simple, secured Bare-Metal

BlueField-3 SuperNIC BlueField-3 SuperNIC

• Spectrum-4
• Fully featured 51.2Tb/s switch Spectrum-4

• Single shared buffer

• Native RoCE support (Cumulus Linux/SONiC)
• AI focused telemetry (NetQ)
Why Spectrum-X

BlueField-3 BlueField-3 SuperNIC

DPU (B3220) (B3140H)

General Purpose Cloud (North-South) AI Fabric (East-West)

Traditional Ethernet Spectrum-X

North – South
Loosely Coupled Applications Distributed Tightly-Coupled Processing
XX XX XX XX

TCP (Low Bandwidth Flows and Utilization) RoCE (High Bandwidth Flows and Utilization)

Low Jitter Tolerance XX XX XX XX

High Jitter Tolerance
(Long Tail Kills Performance)

Many small flows Few large flows Data Center

East – West
NVIDIA Spectrum4 Ethernet AI Switch

AVAILABILTY EFFICENCY PERFORMANCE STANDARD

100G SerDes in production Power efficient, single ASIC, No HBM Adaptive Routing Technology Standard Ethernet/IP packets and
control plane
51.2T, high radix No overheads, natively Non-Blocking Fine Grain Load Balancing
Fully Disaggregated
Designed for AI at Scale Flexible and scalable Leaf-Spine RoCE optimized for cloud (No need for controller)
Proven with GPUs Scale-out w/ less optics Fully shared buffer – Fairness Open SDK
5th Gen ASIC Future proof: supports >4K 800G GPUs Programable Congestion Control SONiC, Cumulus, Linux Switch
Direct Drive savings – Optics and Copper Congestion Isolation

Low Latency and tail latency

Full Stack Optimization

Israel-1 Spectrum-X Generative AI Cloud

• NVIDIA playground to create test reports on real life AI

applications, benchmarkings and to create Reference
Architectures

• Generative AI Cloud Supercomputer based on the new

Israel-developed NVIDIA Spectrum-X platform

• 256 NVIDIA HGX H100, including 2,048 H100 GPUs in

servers provided by Dell Technologies

• 2,560 BlueField-3 DPUs, 80+ Spectrum-4 Ethernet switches,

all developed in Israel

• Among the world’s most powerful supercomputers, most

powerful supercomputer in Israel

Generative AI Cloud Supercomputer • Peak AI performance of 8-Exaflops

https://www.reuters.com/technology/nvidia-build-israeli-supercomputer-ai-demand-soars-2023-05-29/
ISRAEL-1 Digital Twin
Simulation on NVIDIA Air

• Allows
• Full datapath connectivity
• OOB configuration and ZTP can be tested
• Build Automations to configure CumulusLinux
• Build Automations to configure Hosts Linux
• Validate Connectivity configuration by sending traffic from host
• Topology / Cabling can be modified

• Scale
• Simulation can be duplicate for parallel testing
• Multiple users can cases same simulation

• Next Phases
• Increase level of simulation by running SimX-{Switch, DPU}
• Additional storage network can be added

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

Agenda
• Problem definition
• What is Spectrum-X?
• Scalability
• Congestion control
• NCCL
• Topologies
• Adaptive routing
• Multitenancy
• Spectrum-X comparison
Problem Definition

• Traditional data center fabric characteristics

• Mixed flow duration and size
• High entropy, ECMP is adequate
• Fabric utilization <50%
• AI Fabric characteristics
• AI Workloads are made of elephant flows
• Very low entropy
• Very high fabric utilization (90% and above)

• Verdict:
• Tools & features used to build traditional data centers do
not give good results for AI workloads
• Nvidia developed an optimized a congestion control
mechanism and a method to distribute traffic across ports
efficiently (Adaptive Routing)
What is Spectrum-X?
Scalability

2 tier fabric

Up to 8K GPU

• Spectrum-4 ASIC

• Tuned for RoCEv2 (focusing on AI/ML workloads),

Lossless
3 tier fabric

• Up to 8K end-points with 2 tier, up to 512K with 3 Up to 512K GPU

tier fabrics
What is Spectrum-X?
Congestion Control

• The AI fabric must be utilized above 90% so we

make sure that we're using all available GPU
resources and no GPU is idle

• DCQCN (Data Center Quantized Congestion

Notification) is an end-to-end congestion
control method for RoCEv2

• Relies on ECN (Explicit Congestion

Notification)

• Responds to congestion notifications and

dynamically adjusts traffic transmit rates

• Industry Standard and de-facto Congestion

Control standard for data centers running
AI/ML workloads
What is Spectrum-X?
Congestion Control – NVIDIA Spectrum-X

• ZTR-CC (Zero Touch RoCE – Congestion Control) is an

NVIDIA developed technology that allows zero touch RoCE
•Congestion Control – NVIDIA Spectrum-X
deployment on switches and provides Round Trip Time
(RTT) based congestion control.
• Timing packets are periodically sent out from sender are
returned immediately by the receiver. This allows
measurement of RTT and can be used to calculate the
congestion and throttle the traffic sent out
• Spectrum-X is using ZTR-CC
• ZTR-CC : RTT + ECN
• RTT measures the time
• ECN marking increases the fidelity of CC signaling
What is Spectrum-X?
What is NCCL (Nvidia Collective Communications Library – aka Nickel)?

• Multi-GPU scaling and parallelism is the purpose of NCCL

o Reducing the training time,
o Scaling up to 10.000s of GPUs efficiently,
o Maximizing inter-GPU bandwidth.

• Topology-aware and is optimized to achieve high bandwidth and low latency

o PCIe + Ethernet / PCIe + IB
o NVLink + Ethernet / NVLink + IB

• For the best performance : tree or ring-style communication collective (alltoall / allreduce)

Topology detection Graph Search CUDA Kernels

Topologies : Understanding Rail-Optimized vs. Non-Rail Optimized
NCCL-optimized performance with Spectrum-X

Rail-Optimized Optimized Topology:

Non-Rail Optimized Topology:
• Defined by GPU connectivity
• Defined by physical proximity (rack)
• NCCL-optimized topology
• Cable-optimized topology
• Optics between leaf and servers
• Lower AI performance
• Higher AI performance
• 3x higher switch latency between GPUs
Non-Rail Optimized Topology:
• LowersDefined
latency between
proximityGPUs
by physical (rack) • Higher spine congestion
Cable-optimized topology
• Reduces spine traffic
Lower AI performance
3x higher switch latency between GPUs
Higher spine congestion
What is Spectrum-X?
Adaptive Routing

• What is Adaptive Routing (AR) and why do we need it?

• Traditional data center traffic provides high entropy (randomness
in IP headers that can result a fair amount of ECMP load
balancing)
• ECMP flow-based hashing does not work with low entropy / high
bandwidth
• AI workloads =>“elephant flows” (low entropy)
• flow collisions => congestion => higher latency/delay
• NVIDIA developed technology
• Needed to dynamically load balance the data with packet
level granularity
• Avoid collisions, congestion and keep delay minimum and
deterministic
• Performed end to end and invisible to the AI
workloads/applications
• Direct Data Placement (DDP) on BlueField-3
ADAPTIVE ROUTING - FUNDAMENTALS
AR Enables Deterministic Performance at Scale

Bandwidth (Gb) vs Time (sec)

Benefits:
• High throughput, short and deterministic job
completion time
• Lower tail latency
• Utilize unused bw in the fabric
• Boosts the AI job completion time

Average Host Throughput (Gb)

What is Spectrum-X?
Multitenancy

• If a customer is not planning (in spine spine

foreseeable future) to deploy a cluster
larger than 8K GPUs – use 2 tier fabric
leaf … … leaf
• In case a customer considers expanding
beyond 8K GPUs, but would like to start
with < 8K GPUs we suggest using 3 tier …
…
fabric (next slide)

SU1 (256 GPU) SU2 (256 GPU)

Spectrum-X Topologies
3 Tier fabric

• For anything larger than 8K GPUs(future proof as well) – use 3 tier fabric

SuperSpine 1 SuperSpine 64
SuperSpine SuperSpine SuperSpine 1 SuperSpine 64
Plane 1 … Plane 64 …

Spine 1 Spine 64 Spine 1 Spine 64

… … …
Leaf 1 Leaf 8 Leaf 1 Leaf 8 Leaf 1 Leaf 8 Leaf 1 Leaf 8
… … … …

… … …
… … … …

SU1 (512 GPU) SU8 (512 GPU) SU993 (512 GPU) SU1000 (512 GPU)
Spectrum-X Optimizes Each Level of the Application Stack
Performance Tuning From the Hardware to the AI Model

Application Specific

Models
GPT, LLAMA,
NEMO, and more

AI Primitives
NCCL-optimized networking

Network Fabric
Accelerated RDMA performance

Hardware
Best-of-breed switch and SuperNIC
General Purpose
AI Fabric Spectrum-X vs Traditional Ethernet
Traditional Ethernet
Spectrum-X

spine spine
spine spine

leaf … … leaf
leaf … … leaf

… …
… …

SU1 (256 GPU) SU2 (256 GPU)

• Loosely coupled
• End to end optimized for AI performance
• Load-balancing in the fabric is handled
• Load-balancing in the fabric is handled with
with ECMP
Adaptive Routing (local and remote)
• Standard CC
• Advanced programmable CC
Spectrum-X Performance Update
AI Cloud Workload Performance Isolation

AI Cloud - Nemo LLM 43B isolation AI Cloud - FSDP LLAMA 70B isolation
16 participating + 66 noise 16 participating + 38 noise
2.5 30

Iteration Time (Seconds)

25
2

Lower is Better
Lower is Better

20
1.5 1.2X faster 15
1 1.7X faster
10
0.5 5
0 0
Traditional Ethernet
Legacy Ethernet Spectrum-X Theoretical Peak Traditional Ethernet
Legacy Ethernet Spectrum-X Theoretical Peak

Spectrum-X accelerates iteration times for training the most common AI models such as Nemo and LLAMA.
Faster training iterations lead to faster job completion times, accelerating time to insight.

Spectrum-X accelerates training AI models in noisy AI Cloud environments.

Spectrum-X vs. InfiniBand
Comparison

Spectrum-X InfiniBand

Ethernet-based implementation of InfiniBand HPC Lowest latency, highest performance, originally designed for
architecture and feature-set adopted to Nvidia Spectrum switches, using
HPC applications and ideal for AI workloads, provides
ROCE
the highest performance in the industry

Address the gap for the customers requiring an Ethernet-based Specific upper layer protocols that are more performant comparing to
AI fabric traditional TCP/UDP and performs superior with NCCL-based
applications
Designed with RoCE in mind and brings configuration Credit-based flow/congestion control and Adaptive
simplicity as well as stability Routing are some of the features native to InfiniBand
Ideal for AI-Cloud use case (Multitenancy environment) Ideal for AI-Factory use case
Learn More About Spectrum-X
Available Resources

NVIDIA Spectrum-X Networking Platform

NVIDIA Spectrum-4 NVIDIA BlueField-3

Ethernet Switch SuperNIC

Spectrum-X Spectrum-X Technical Spectrum-X Technical

Intro Video Webpage Blog Datasheet Whitepaper

Coming Reference Deployment POC

soon! Architecture Guide Guide
Try out the new Spectrum-X NV Air Lab
Spectrum-X – Compute Switch Configuration NV Air Lab

https://air.nvidia.com
SN5400
64x 400GbE (50G Lanes)
NPU NVIDIA Spectrum -4

Switching Capacity • 25.6Tbps 800 GbE USB

PPS In/Out [or 2x400GbE] Type A OOB
• 64x 400GbE: QSFP-DD, 12W Typical
Ports
• 2x 25GbE, SFP28, 2.5W Typical

• SyncE
• Stratum-3E oscillator
Timing • PTP 1588 Profiles
• PPS In/Out
• PTP Master Clock source by external SFP

• x86, Hexa-Core Xeon 2.2GHz

System CPU • RAM: DDR4 SDRAM 32GB
• Image storage: SATA SSD 160GB 25G
Serial Console
System Power • PSU: AC, 1+1 redundancy, hot swap

• H: 2U, 3.43’’ (87mm)

Dimensions • W: 16.8’’ (428mm)
• D: 28.3’’ (720mm)

Airflow • 4 fans, N+1, hot swap, Reverse

• ES: February 2024

Availability
• GA: April 2024
28
AC PS Swappable Fans AC PS
SN5600
64x 800GbE (100G Lanes)

NPU NVIDIA Spectrum -4

Switching Capacity • 51.2Tbps

• 64x 800GbE: OSFP 18W Typical 800 GbE USB

Ports 25G [or 2x400GbE] Type A OOB
• 1x 25GbE, SFP28, 2.5W Typical

• x86, Hexa-Core Xeon 2.2GHz

System CPU • RAM: DDR4 SODIMM 32GB
• Image storage: SATA SSD 160GB

System Power • PSU: AC, 1+1 redundancy, hot swap

• H: 2U, 3.43’’ (87mm)

Dimensions • W: 16.8’’ (428mm)
• D: 28.3’’ (720mm)
Serial Console
Airflow • 4 fans, N+1, hot swap, Reverse

• ES: Available
Availability
• GA: December 2023

29
AC PS Swappable Fans AC PS

NVIDIA Spectrum-X
No ratings yet
NVIDIA Spectrum-X
5 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Infiniband Networking Sales Training
No ratings yet
Infiniband Networking Sales Training
17 pages
InfiniBand Key Features - Summary
No ratings yet
InfiniBand Key Features - Summary
38 pages
Spectrum X Networking Platform Administration 2025
No ratings yet
Spectrum X Networking Platform Administration 2025
3 pages
Cisco Nexus 9000 Series Switches Overview
No ratings yet
Cisco Nexus 9000 Series Switches Overview
5 pages
NB 06 Nexus 9000 Series Switches So Cte en
No ratings yet
NB 06 Nexus 9000 Series Switches So Cte en
5 pages
The Impact of AI Workloads On Moden DCN
No ratings yet
The Impact of AI Workloads On Moden DCN
18 pages
Connectx Datasheet Connectx 8 Supernic 3231505
No ratings yet
Connectx Datasheet Connectx 8 Supernic 3231505
2 pages
Nvidia Modules 2
No ratings yet
Nvidia Modules 2
9 pages
RA 11336 001 DSPH200 ReferenceArch
No ratings yet
RA 11336 001 DSPH200 ReferenceArch
34 pages
TN 2063 Mellanox Networking
No ratings yet
TN 2063 Mellanox Networking
37 pages
Data Center Evolution: Jacques Fluet
No ratings yet
Data Center Evolution: Jacques Fluet
18 pages
Brkopt 2699
No ratings yet
Brkopt 2699
91 pages
Ethernet Switches New Product Brief sn2201 2258470
No ratings yet
Ethernet Switches New Product Brief sn2201 2258470
4 pages
NEW CCNA Data Center Chapter1
100% (1)
NEW CCNA Data Center Chapter1
81 pages
Line Card
No ratings yet
Line Card
4 pages
AI Network WP
No ratings yet
AI Network WP
16 pages
NVIDIA MTE - Perfect The Art of Your Network Landscape
100% (1)
NVIDIA MTE - Perfect The Art of Your Network Landscape
14 pages
Leviton DataCenterHandbook
No ratings yet
Leviton DataCenterHandbook
63 pages
Cisco's 800G-Ready Cloud Infrastructure
No ratings yet
Cisco's 800G-Ready Cloud Infrastructure
31 pages
InfiniBand and High-Speed Ethernet For Dummies
No ratings yet
InfiniBand and High-Speed Ethernet For Dummies
134 pages
Data Center Interconnect Design Guide Test Fail VPC
No ratings yet
Data Center Interconnect Design Guide Test Fail VPC
81 pages
Ixia Vision Edge 40 (E40) Network Packet Broker: Highlights
No ratings yet
Ixia Vision Edge 40 (E40) Network Packet Broker: Highlights
7 pages
sn5600 Datasheet A4
No ratings yet
sn5600 Datasheet A4
8 pages
Nvidia Spectrum Sn2000 Series Switches
No ratings yet
Nvidia Spectrum Sn2000 Series Switches
9 pages
AI Data Center Design Guide
No ratings yet
AI Data Center Design Guide
34 pages
Transforming Data Center Core With Dce Cisco Nexus
No ratings yet
Transforming Data Center Core With Dce Cisco Nexus
27 pages
AI/ML Cluster Networking Solutions
No ratings yet
AI/ML Cluster Networking Solutions
20 pages
1 - Introduction To InfiniBand
No ratings yet
1 - Introduction To InfiniBand
21 pages
h3c KZ Brocade - Products
No ratings yet
h3c KZ Brocade - Products
38 pages
Leviton DataCenterNetworkInteractiveHandbook PDF
100% (1)
Leviton DataCenterNetworkInteractiveHandbook PDF
62 pages
Extremeswitching X690: Product Overview
No ratings yet
Extremeswitching X690: Product Overview
8 pages
Abhishek Verma
No ratings yet
Abhishek Verma
22 pages
Nvswitch Technical Overview
No ratings yet
Nvswitch Technical Overview
8 pages
Datacenters (Optional Fun)
No ratings yet
Datacenters (Optional Fun)
50 pages
Nexus 9000 Ai Networking Aag
No ratings yet
Nexus 9000 Ai Networking Aag
4 pages
Data Center - Bicsi
No ratings yet
Data Center - Bicsi
8 pages
Speaker0 Session355 1
No ratings yet
Speaker0 Session355 1
28 pages
ROADM Technology
No ratings yet
ROADM Technology
26 pages
GTC2025 Keynote
No ratings yet
GTC2025 Keynote
73 pages
Data Center Network Evolution Insights
No ratings yet
Data Center Network Evolution Insights
42 pages
InfiniBand Vs Ethernet
No ratings yet
InfiniBand Vs Ethernet
15 pages
BR sn3000 Series
No ratings yet
BR sn3000 Series
8 pages
Data Center Guide
No ratings yet
Data Center Guide
49 pages
Enabling The Next Generation of Cloud & Ai Using 800Gb/S Optical Modules Enabling The Next Generation of Cloud & Ai Using 800Gb/S Optical Modules
No ratings yet
Enabling The Next Generation of Cloud & Ai Using 800Gb/S Optical Modules Enabling The Next Generation of Cloud & Ai Using 800Gb/S Optical Modules
16 pages
Speaker - A01 - 5752 - Reinvent The AI Networking
No ratings yet
Speaker - A01 - 5752 - Reinvent The AI Networking
32 pages
DC 3.0 Next Generation Data Center Design: © 2008 Cisco Systems, Inc. All Rights Reserved. Cisco Confidential
No ratings yet
DC 3.0 Next Generation Data Center Design: © 2008 Cisco Systems, Inc. All Rights Reserved. Cisco Confidential
47 pages
Ixia T DS IxNetwork
No ratings yet
Ixia T DS IxNetwork
16 pages
Nexus 9364e Sg2 Switch Ds
No ratings yet
Nexus 9364e Sg2 Switch Ds
9 pages
6.888 Advanced Topics in Networking: Lecture 1: Introduction
No ratings yet
6.888 Advanced Topics in Networking: Lecture 1: Introduction
51 pages
Data Center Design
No ratings yet
Data Center Design
47 pages
Data Center Infrastructure
100% (1)
Data Center Infrastructure
74 pages
TCP VS Udp
No ratings yet
TCP VS Udp
5 pages
Download Microsoft Word Free 2018
No ratings yet
Download Microsoft Word Free 2018
3 pages
ELM327DS
No ratings yet
ELM327DS
0 pages
Quick Reference Card - Hardware PDF
No ratings yet
Quick Reference Card - Hardware PDF
2 pages
Nighthawk App Setup & Features Guide
No ratings yet
Nighthawk App Setup & Features Guide
2 pages
Ceph Reference Architecture
100% (1)
Ceph Reference Architecture
12 pages
M.SC Syllabus - Previous and Final Year - 2003
No ratings yet
M.SC Syllabus - Previous and Final Year - 2003
17 pages
SAP BW Data Management Insights
No ratings yet
SAP BW Data Management Insights
89 pages
COMPUTER NETWORKS Unitwise Important Questions
No ratings yet
COMPUTER NETWORKS Unitwise Important Questions
6 pages
Carrier Ethernet Edge Switch: Nokia Siemens Networks
No ratings yet
Carrier Ethernet Edge Switch: Nokia Siemens Networks
4 pages
Remote Control Expert Menu Guide
No ratings yet
Remote Control Expert Menu Guide
2 pages
Internet's Societal Impact Explained
No ratings yet
Internet's Societal Impact Explained
16 pages
Junos OS: Hardware and Routing Guide
No ratings yet
Junos OS: Hardware and Routing Guide
9 pages
Datasheet F2816
No ratings yet
Datasheet F2816
3 pages
Language Characteristics of The Internet Language
No ratings yet
Language Characteristics of The Internet Language
11 pages
Compare The UAV-RIS-backhaul System
No ratings yet
Compare The UAV-RIS-backhaul System
1 page
Networking Devices Overview
No ratings yet
Networking Devices Overview
34 pages
Zoom Connection Process Explained
No ratings yet
Zoom Connection Process Explained
4 pages
Virtual AGM Template and Script Guide
No ratings yet
Virtual AGM Template and Script Guide
10 pages
Huawei Recommended 5g Kpis 161
No ratings yet
Huawei Recommended 5g Kpis 161
15 pages
Wireless and Mobile Network Assignment 2 # 1G, 2G, 3G, 4G
33% (3)
Wireless and Mobile Network Assignment 2 # 1G, 2G, 3G, 4G
3 pages
Education: Shakeeb Ahmed
No ratings yet
Education: Shakeeb Ahmed
1 page
Proxy To Soap Synchronous Scenerio in PI PDF
No ratings yet
Proxy To Soap Synchronous Scenerio in PI PDF
9 pages
TTD-20T EN 301 406 Test Report
No ratings yet
TTD-20T EN 301 406 Test Report
27 pages
CCNA OSI Model Overview and Functions
No ratings yet
CCNA OSI Model Overview and Functions
55 pages
Artificial Neural Networks in ML
No ratings yet
Artificial Neural Networks in ML
18 pages
Unit 3 - Wireless and Mobile Computing - WWW - Rgpvnotes.in
No ratings yet
Unit 3 - Wireless and Mobile Computing - WWW - Rgpvnotes.in
12 pages
10-45 TH BCS & BIBM Question ICT
No ratings yet
10-45 TH BCS & BIBM Question ICT
9 pages
FNA White Paper
No ratings yet
FNA White Paper
13 pages
HP Color Laserjet Pro M454 Series
No ratings yet
HP Color Laserjet Pro M454 Series
5 pages

Spectrum X Technical Overview Munich 1710156600119

Uploaded by

Spectrum X Technical Overview Munich 1710156600119

Uploaded by

NVIDIA Spectrum-X Networking

QUANTUM-2 SWITCH (INFINIBAND)

64-Ports of 400 Gbps or 128-Ports of 200 Gbps

SHARPv3 Small Message Data Reductions

SHARPv3 Large Message Data Reductions

32X More AI Acceleration Engines

CONNECTX-7 INFINIBAND/ETHERNET BLUEFIELD-3 INFINIBAND/ETHERNET

16 Core / 256 Threads Datapath Accelerator 16 Arm A78 64-Bit Cores

Hardware-Based RDMA / GPUDirect Full Transport Offload and Telemetry

MPI Tag Matching and All-to-All Hardware-Based RDMA / GPUDirect

8K GPUs in 2-level, 500K in 3 level Security Engines

• Maximizing effective bandwidth

• Optimized & consistent bandwidth upon any failure

BlueField-3 SuperNIC BlueField-3 SuperNIC

• Single shared buffer

BlueField-3 BlueField-3 SuperNIC

General Purpose Cloud (North-South) AI Fabric (East-West)

Low Jitter Tolerance XX XX XX XX

Many small flows Few large flows Data Center

AVAILABILTY EFFICENCY PERFORMANCE STANDARD

Low Latency and tail latency

Full Stack Optimization

• NVIDIA playground to create test reports on real life AI

• Generative AI Cloud Supercomputer based on the new

• 256 NVIDIA HGX H100, including 2,048 H100 GPUs in

• 2,560 BlueField-3 DPUs, 80+ Spectrum-4 Ethernet switches,

• Among the world’s most powerful supercomputers, most

Generative AI Cloud Supercomputer • Peak AI performance of 8-Exaflops

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

• Traditional data center fabric characteristics

• Tuned for RoCEv2 (focusing on AI/ML workloads),

• Up to 8K end-points with 2 tier, up to 512K with 3 Up to 512K GPU

• The AI fabric must be utilized above 90% so we

• DCQCN (Data Center Quantized Congestion

• Relies on ECN (Explicit Congestion

• Responds to congestion notifications and

• Industry Standard and de-facto Congestion

• ZTR-CC (Zero Touch RoCE – Congestion Control) is an

• Multi-GPU scaling and parallelism is the purpose of NCCL

• Topology-aware and is optimized to achieve high bandwidth and low latency

Topology detection Graph Search CUDA Kernels

Rail-Optimized Optimized Topology:

• What is Adaptive Routing (AR) and why do we need it?

Bandwidth (Gb) vs Time (sec)

Average Host Throughput (Gb)

• BGP for underlay (numbered links)

• If a customer is not planning (in spine spine

SU1 (256 GPU) SU2 (256 GPU)

Spine 1 Spine 64 Spine 1 Spine 64

SU1 (256 GPU) SU2 (256 GPU)

Iteration Time (Seconds)

Spectrum-X accelerates training AI models in noisy AI Cloud environments.

NVIDIA Spectrum-X Networking Platform

NVIDIA Spectrum-4 NVIDIA BlueField-3

Spectrum-X Spectrum-X Technical Spectrum-X Technical

Coming Reference Deployment POC

Switching Capacity • 25.6Tbps 800 GbE USB

• x86, Hexa-Core Xeon 2.2GHz

• H: 2U, 3.43’’ (87mm)

Airflow • 4 fans, N+1, hot swap, Reverse

• ES: February 2024

NPU NVIDIA Spectrum -4

Switching Capacity • 51.2Tbps

• 64x 800GbE: OSFP 18W Typical 800 GbE USB

• x86, Hexa-Core Xeon 2.2GHz

System Power • PSU: AC, 1+1 redundancy, hot swap

• H: 2U, 3.43’’ (87mm)

You might also like