0% found this document useful (0 votes)
101 views31 pages

S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm

Uploaded by

ys03241002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views31 pages

S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm

Uploaded by

ys03241002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Empowering

Next-generation AI
with NVIDIA DGX
SuperPOD and Pure
Storage AI Solutions
Matt Taylor Robert Alvarez, PhD
VP Business Development HPC & AI AI Solutions Architect
Pure Storage Pure Storage
former AI Solutions Architect, NASA
Pure Storage & NVIDIA
Collaboration

Challenges

How Pure Storage is

Agenda
Different

What Pure Storage


Delivers

Pure Storage with NVIDIA DGX SuperPOD

Solving Real-World
Problems
©2025 Pure Storage 2
Pure Storage and NVIDIA Collaboration
First Enterprise Widest range of offerings Global customers across all
AI storage • NVIDIA DGX SuperPOD™ certification industries
Collaboration • NVIDIA DGX BasePOD™ AIRI) certification
March 2018 • NVIDIA OVX™ validation
• High Performance Storage certification supporting
NVIDIA Cloud Partners
• GenAI Pod
• RAG and NVIDIA NIM™ Reference Architectures

©2025 Pure Storage 3


Key Challenges
Key Challenges for an AI Data Platform

Maintaining Peak GPU Keeping GPUs at maximum utilization while dynamically managing
Utilization diverse workloads.

Seamlessly scale capacity and performance non-disruptively, paying


Scaling Performance
only for the capacity consumed.

Resiliency with no planned downtime and high availability for critical


Maintaining Availability
services.

Meeting Space and Power Maximize storage density at the lowest energy consumption to
Constraints maximize use of datacenter capacity.

©2025 Pure Storage 5


How Pure Storage is
Different
Leading Multi-dimensional Performance
Simultaneous reads, writes and metadata operations

Pure Storage Data Center Efficiency


Delivers 25x* More Space and Power Efficient
1025x more reliable than other SSD and HDD vendors

Higher GPU Simplest Architecture


Utilization Consolidation of all tiers on one architecture,
Non-disruptive expansion of performance and capacity
Driven by unique differentiation
in native flash management,
a modern SW architecture, and Flexible Deployment
a focus on simplicity and Always-improving life cycle, non-disruptive upgrade forever across generations
management at scale SLA-based consumption model

Powering the Most Demanding AI


Meta, Coreweave, Softbank and more

©2025 Pure Storage Typical configurations 7


How Pure Storage is
Different
Leading Multi-dimensional Performance
Outcomes in hours/days instead of months

Any job | Any protocol | Any size | Any object count | Any processing type
Ingestion | Persistence | Processing | Training | Inference

Predictable Performance without


Large /
Checkpoint expansion complexity
Small Files

Scale performance No tuning needed


granularly on demand
High
Low High
Metadata
Latency Bandwidth
Performance

Performance without Industry leading


deep expertise Watts/IOPS &
Random Sequential
Access Access
Watts/GB/Sec
Intuitive and easy to 85% less energy
use interface consumption

AI workloads require multi-dimensional performance

©2025 Pure Storage 9


Data Center Efficiency:
Highest Reliability and Longest Lifetime
5x over conventional SSDs, 7x over nearline HDD

Annual Return Rates %


DirectFlash® Module provide:

● Low failure rates result in higher efficiency.

● 1025x the reliability of SSDs on a per-bit


basis
1

● Failure rate does not depend on capacity,


2
which is critical to scaling.

● DFMs also have a longer lifetime compared


to HDD and SSDs.

● Pure1® telemetry drives continued


improvement through field visibility.

DirectFlash Modules unlock Capacity, Performance, Reliability at Scale


1 https://www.backblaze.com/blog/backblaze-drive-stats-for-2022/
2 https://www.backblaze.com/blog/ssd-edition-2022-drive-stats-review/

©2025 Pure Storage 10


Simplest and Most Flexible Architecture
An SDS-based Approach The Pure Storage Approach
1 Data VIP 1 Data VIP 1 Data VIP
1 Mgmt VIP 1 Mgmt VIP 1 Mgmt VIP
Customer
Multiple Multiple
Network Multiple
Uplinks to Uplinks to
uplinks to
customer customer customer
network network network
Compute
Nodes

Spine &
Internal
Network
Leaf
Single Chassis ...
Upto 6 PB

Storage Two Chassis


Nodes Upto 12 PB

Three Chassis
Starting As System Upto 18 PB
Configuration Scales
Manageable Exponential Increase Always Simple at Any Scale
Complexity in Complexity

# of # of networking # of IPs and customer


The Difference: components layers
Ease of Scaling Load Balancing
network port consumption

©2025 Pure Storage 11


Flexible Deployment: Accelerate Time to Results
Guaranteed SLA-based outcomes
✓ On-Prem, Public Cloud, and Hybrid Cloud
✓ No hardware investment
✓ Onus on Pure to deliver appropriate solution
✓ Operational costs included
Skip the sizing: Evergreen//One For AI is guaranteed to
work and be cost-effective
Storage-as-a-Service
Hardware Purchase

✓ On-Prem and Hybrid Cloud


✓ Hardware ownership
✓ Subscription to innovation
✓ Buy it once, use it forever
✓ Upgrade in-place, no data migrations

©2025 Pure Storage 12


Powering the Most Demanding AI
Meta Facebook) AI Research Supercluster

● 675PB of usable storage, fronted by a cache-tier


● Providing bulk storage for production model
training
● Optimal capacity footprint 250TB/RU, 1.2W/TB
● All-flash capacity tier 10GB/s/PB
● Denser and more power and cost efficient that
HDD

● 15PB of NFS Storage 3 clusters)


● Home dirs and working space for AI researchers
● Optimal performance 198TB/RU, 2.25W/TB
©Meta http://ai.facebook.com/blog/ai-rsc ● High throughput file & object 20GB/s/PB
● Well over ½ Exabyte of Flash
● 5 Exaflops of compute power
● All on a massive flat network ai.meta.com/blog/ai-rsc

©2025 Pure Storage 13


Powering the Most Demanding AI

©2025 Pure Storage 14


The Pure Storage with NVIDIA
DGX SuperPOD
What is a NVIDIA DGX SuperPOD?

When you need: • Choice of flexible vs performance optimized designs • A replica of NVIDIA infrastructure
• Inclusion of non-DGX SuperPOD certified storage • A turnkey deployment
• Full-stack support of the entire deployment

NVIDIA DGX BasePOD™ NVIDIA DGX SuperPOD™

You choose:

Scalable, foundational architecture Physical twin of NVIDIAʼs infrastructure

• Flexible reference architectures • Turnkey data center infrastructure product


You get: • Powered by Base Command • Powered by Base Command
• Validated against key benchmarks • Certified performance for the most complex
• Foundation for partner branded offerings workloads
• No customization, no partner re-branding

©2025 Pure Storage 16


The Pure Storage with NVIDIA DGX SuperPOD

Storage-as-a-Service FlashBlade//S™
• Solve the unpredictability of AI with SLA-driven Storage-as-a-Service • Scalable & flexible QLC based-storage on
• Guaranteed performance for any AI workload Ethernet
• Non-disruptive HW and SW upgrades = no planned downtime • Multi-dimensional performance powers
enterprise training
• Reduce energy, space and carbon emissions
NVIDIA DGX SuperPOD w FlashBlade//S by up to 85%

NVIDIA DGX Integrations


• NVIDIA Magnum IO™ GPUDirect®
Storage GDS
• Pure Aggregated Client PAC

Single Control Plane


• Observability & manageability for capacity & performance
• Plan for future performance and capacity needs, optimize energy
efficiency, and secure critical data

©2025 Pure Storage 17


The Pure Storage with NVIDIA DGX SuperPOD Network Topology
Transparent Scaling of DGX SuperPOD Configuration

SN5600 Customer Network


16  400Gb
0Gb 440
4400Gb 440 0Gb 4400Gb
XFM8400 FlashBlade Network

4100Gb 4100Gb

FB//S chassis 1

Notes
4100Gb
4100Gb • Add capacity and/or performance with no disruption
FB//S chassis 2
• Multiple chassis use the same uplink and management
configuration allowing for transparent expansion to
NVIDIA network as number of SUs increases.
FB//S chassis n

©2025 Pure Storage 18


Why Ethernet?

• High Speed Data Transfer


• RDMA Support In the Pure Storage with NVIDIA DGX
• Low Latency Network SuperPOD deployment
• Multi-tenant • All Intra-GPU traffic East-West) is handled
via Infiniband
• Low HW Cost
• All North-South traffic is handled via
• Flexibility and Scalability
Ethernet
• Improved Congestion Control
• Scaling up to a 10 chassis FlashBlade
• Multi-vendor 60PB) is transparent to NVIDIA network
• Simplicity

©2025 Pure Storage 19


DGX SuperPOD Guidance for Storage Performance
Performance Good GBps Better GBps Best GBps

WORKLOADS TEXT TEXT / IMAGES 4K VIDEO

Single-node read

Single-node write

Single SU aggregate system read

Single SU aggregate system write

4 SU aggregate system read

4 SU aggregate system write

©2025 Pure Storage 20


Pure Storage with NVIDIA DGX SuperPOD
Differentiation
Why you need Pure Storage FlashBlade with NVIDIA DGX SuperPOD

• Multi-dimensional performance
• Non-disruptive software and hardware upgrades
• 65% less power consumption and 40% less rack space.
• Consolidated data silos
• Evergreen//One solves the unpredictability of AI with SLA-driven storage-as-a-service
• Ethernet networking offers lower TCO, broad interoperability, and proven reliability

©2025 Pure Storage 21


Solving Real World
Problems
Accelerating the AI Journey: From Data to Scale

AI-ready Data AI Exploration AI-enabled At Scale


Engineering Dev Test Products / Businesses Large AI Models

Time to Data Time to Science Time to Deploy Time to Scale

Pure Storage Platform


Storage as-a-Service | Effortless Management at Any Scale | Simple Unified Infrastructure | Evergreen Architecture

©2025 Pure Storage 23


AIReady: Powering AI Across the Full Stack
Drive simplicity and speed with a unified,
partner-driven stack for AI innovation
MLOps

Seamlessly integrate AI/ML pipelines,


including multimodal GenAI and analytics App Workflow .data
NVIDIA NIM
workloads
Foundational Models Llama 3

Enable scalable data management for Vector DBs KDB.AI


training, validation, inference, and audit
needs across the AI lifecycle Scheduling

Manage AI-related data efficiently— Distributed AI


training/validation datasets, inference
audits/monitoring, model artifacts, and GPU Infrastructure
semantic vectors, etc.
Storage
Optimize GPU and compute scheduling for
efficient, high-performance AI deployments. CHIP / Network Fabric Spectrum Quantum*

©2025 Pure Storage 24


Exploration: Pure Storage Full-stack GenAI Pod
Frictionless Customer Experience
GenAI Deployment

Management and Orchestration Inc. GPU


Racked, Stacked &
L1, L2 Support Single SKU to
GenAI Building MicroServices NREM Configured upon
NVIDIA NIM
Across Full-Stack Purchase
Delivery
Foundational Models Llama 3

• Minimize time to deploy a GenAI pipeline by 90% through


Vector DBs
one-click deployment.

Kubernetes • Single pane of glass for the GenAI Pod NIM models,
VectorDBs).
Compute
• Delivers a unified UX driving efficiency across the whole GenAI
Storage
stack on a per token basis.
• Built-in general RAG workflow templates extensible for use
Network Arista across multiple verticals.

Pure Storage GenAI Pod delivers an integrated HW+SW solution with one-click deployment of all GenAI
SW components, with built-in RAG workflow templates.
©2025 Pure Storage 25
AI Enabled: Pure Storage and NVIDIA NIM

NVIDIA NIM
Deploy anywhere and maintain control of
generative AI applications and data

Simplified development of AI applications that


can run in enterprise environments
Prebuilt container and Helm chart
Day 0 support for all generative AI models
providing choice across ecosystem
Industry standard APIs

Support for custom models Improved TCO with best latency and throughput
running on accelerated infrastructure
Domain specific code

Optimized inference engines Best accuracy for enterprise by enabling


tuning with proprietary data sources

Enterprise software with feature branches,


validation, and support
©2025 Pure Storage 26
At Scale: Speed Up Inferencing, Training, and
Checkpointing
• Unlock Faster Insights: AI inferencing
boosted by 40%¹
• Minimize Downtime: Model
checkpointing up to 15% faster¹
• No Idle GPUs: Sustain peak AI
performance with faster data access
• Pure Storage accelerates data transfer
to GPUs, removing bottlenecks in model
training and fine-tuning

¹As compared to local disk


©2025 Pure Storage 27
Unleashing RAG Performance

NVIDIA SN3700 NVIDIA SN3700

36% Faster Data Ingestion

100GbE 100GbE 100GbE 100GbE

CX7

NVIDIA DGX A100

NVIDIA NeMo™ Retriever FlashBlade


Microservices

Ingest Microservice Vector DB Retriever Microservice

©2025 Pure Storage 28


Summary

● Power end-to-end AI pipelines, enabling seamless data flow from ingestion to


inferencing
● Reduce time to data, science, deployment, and scale, ensuring AI innovation at
enterprise speed
● Optimize AI workloads with accelerated model training, fine-tuning, and
checkpointing
● Unlock 36% faster ingestion times and linear scaling for Retrieval-Augmented
Generation RAG) deployments
● Eliminate GPU bottlenecks with high-speed data delivery and efficient storage for
uninterrupted performance

©2025 Pure Storage 29

You might also like