S42302
BUILDING A DATA CENTER DIGITAL TWIN
WITH NVIDIA OMNIVERSE AND NVIDIA AIR
AMIT KATZ AND MARC HAMILTON
THE CHALLENGE OF AI INFRASTRUCTURE
Enterprise AI requires time, expertise and the right approach to architecture
DESIGN LENGTHY ON-GOING
COMPLEXITY DEPLOYMENT OPERATIONS
Ensuring predictable Procuring, installing, integrating Training and ramping day
performance that scales multiple technologies to day production
2
DATA CENTER DESIGN IS AN EXTREMELY COMPLEX TEAM SPORT
TODAY NO TOOL SPANS COMPONENT TO FULL DATA CENTER
LARGE TEAMS WITH DIVERSE SKILLS MANY VENDORS, MANY TOOLS RISE OF IMPORTANCE OF CFD
Component design (heatsink, etc), server design, Design teams plagued with often incompatible tools Ever rising power densities and new focus on energy
rack layout, enclosure layout, building, CFD, causes tedious import-export, mistakes, time lost. efficiency bring new design challenges to data
cooling, power, etc. Design artifacts aren’t re-used in operations. center
.
COMPLEX 8-RAIL CABLE LAYOUT
4
CAMBRIDGE-1 SIMULATION OF UNDERFLOOR PRESSURE
5
SIMULATION OF TEMPERATURE 4 FEET ABOVE RAISED FLOOR
60s – Temperature Plot 4 ft. Above Raised Floor
Mean Temperature: 95.1°F (excluding cold aisle)
Max Temperature: 110.1°F
6
NVIDIA DGX SUPERPOD CFD SIMULATION
7
CAMBRIDGE-1 IN OMNIVERSE
8
NVIDIA OMNIVERSE ENTERPRISE
9
Simulation Data Observation Data
Model Layer Templates
...
...
Initial & Boundary Differential Equations
Conditions 𝛻𝑥
𝐷ℎ 𝐷𝑝
...
...
ρ = + s ⋅ 𝑘s𝑇 + Φ 𝛻𝑦
𝐷𝑡 𝐷𝑡
Parameterized 𝜕𝜌
Geometry + s ⋅ (𝜌𝑢) = 0
𝜕𝑡
MODULUS
Data Preparation Module Neural Physics Model Compiler
Neural Physics Training Engine
TensorFlow/PyTorch NVIDIA AI stack
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 10
NVIDIA MODULUS - NVSWITCH HEAT SINK
Validation/Verification CFD Solvers vs Modulus/SimNet
11
PARAMETERIZED DGX-A100 NVSWITCH HEAT SINK
10,000x faster using Modulus
Computational Times
(10 parameters, 3 values per parameter)
Modulus (Training Time) 10,800 V100 GPU hrs.
Traditional Solver (OpenFOAM) 18.4 Million CPU hrs.
59,049 separate runs
(26 wall hours on 12 CPU cores)
12
Remove DPU from this slide?
NVIDIA BASE COMMAND
Connecting Real World and Digital Twin
Proven solutions already used Dashboard / Analytics Infrastructure Monitoring Resource allocation
within NVIDIA
• Base Command Manager
• Part of every new DGX SuperPOD
deployment
DGX SuperPOD
• Infrastructure management for IT
Base Command Manager
• Scheduling, resource utilization,
analytics, etc.
Deployment | Provisioning | Monitoring | Logging & Alerting | SLURM
Features:
• Provisioning & lifecycle management
IT/DevOps
• Monitoring & Telemetry
• Logging & Alerting
• SLURM scheduler
13
NVIDIA AIR
NVIDIA AIR
14
OMNIVERSE SIMULATES THE WORLD
Architecture, Engineering, and Media, Entertainment, and Game Product Design and Manufacturing
Construction Development
Scientific Visualization Robotics Autonomous Vehicles
DATA CENTER DIGITAL TWINS
Food for thought
Foster + Partners simulates the bridge before pouring concrete
Daimler validates the AV software stack before driving a mile
Why would a data center be different?
DATA CENTER DIGITAL TWINS
Food for thought
Structural Models
AV Models
Data Center Models
THE JOURNEY TO A SIMULATED DATA CENTER
Network Operations Have Changed
1990
1990: CLI commands in the 2000: Test-to- 2020: Virtual to Physical Tomorrow: Digital Twin The Future: DC
production environment production pipelines networks from NVIDIA Air for E2E DC Validation Recommendation Engine
DATA CENTER SIMULATION
Leveraging the combined power of Omniverse and NVIDIA Air
Architecture, Engineering, and Data Center: Space, power,
Construction Cooling and cabling
Courtesy of
WPP
Data Center Facility Design Data Center Network Design Change Management
NVIDIA AIR
Platform for Simulating and Emulating Data Center Infrastructure
Hardware
DevOps SPECTRUM in the NVIDIA OFFERINGS
Loop Internal APIs
VM
VM
NOS VM
FW Real BLUEFIELD EGX DGX
ASIC Compute,
Architecture
API Network, Endpoints
WebUI Storage
Digital Twin
CERTIFIED PARTNER OFFERINGS
Engineering VM
VM
VM Gateway 3rd Party APIs
VM
VM
VM
Virtualized Network FIREWALL SDN STORAGE
AI
Operator
Outbound Inbound
Connectivity NVIDIA AIR PLATFORM Connectivity
DEPLOYMENT LIFE CYCLE
From POC to Decommission
DAY 0 DAY 1 DAY 2
PLAN BUILD MAINTAIN
DevOps
Digital Twin Physical DC
3. Deploy CI/CD
Export change 1. Apply change
Configuration
Physical Twin
Digital Twin
2. Validate
change
Training & Education Change Validation
Automation Development
Presales PoC Labs Troubleshooting Assistance
Provisioning Process Development
Interop Testing New Personnel Onboarding
NVIDIA AIR
Solving the Hardest Challenges Through Cloud Agility With On-Prem Economics
Full stack Real world Testing Availability and
workflow hardware software stack accessibility of
modeling validation interoperability test tools
NVIDIA AIR
END TO END SIMULATION HIGH FIDELITY OPEN ECOSYSTEM PUBLIC CLOUD PLATFORM
SEE YOU IN THE SIMULATION
EXPLORE OMNIVERSE ENTERPRISE GET ACCESS TO A FREE OMNIVERSE TRIAL DEVELOP ON OMNIVERSE
EXPLORE NVIDIA AIR JUMP START YOUR NETWORK AUTOMATION DIVE INTO THE NVIDIA AIR DOCS