0% found this document useful (0 votes)

52 views13 pages

Tme303 Dspa100 Ra1138901 Vast

Uploaded by

abery.au

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views13 pages

Tme303 Dspa100 Ra1138901 Vast

Uploaded by

abery.au

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

NVIDIA DGX SuperPOD: VAST

Reference Architecture
Featuring NVIDIA DGX H100 and DGX A100 Systems

RA-11389-001 v1
June 2023
Abstract
The NVIDIA DGX SuperPOD™ with NVIDIA DGX™ H100 or DGX A100 systems is an
artificial intelligence (AI) supercomputing infrastructure, providing the computational
power necessary to train today's state-of-the-art deep learning (DL) models and to fuel
future innovation. The DGX SuperPOD delivers groundbreaking performance, deploys as
a fully integrated system, and is designed to solve the world's most challenging
computational problems.
This DGX SuperPOD reference architecture (RA) is the result of collaboration between
DL scientists, application performance engineers, and system architects to build a
system capable of supporting the widest range of DL workloads. The groundbreaking
performance delivered by the DGX SuperPOD with DGX systems enables the rapid
training of DL models at great scale. The integrated approach of provisioning,
management, compute, networking, and fast storage enables a diverse, multi-tenant
system that can span data analytics, model development, and AI inference.
The VAST Data Platform was evaluated for suitability for supporting DL workloads when
connected to the DGX SuperPOD. The VAST Data Platform is the first enterprise NAS
solution certified for DGX SuperPOD, providing the performance and scalability of
parallel file system based architectures, but with the simplicity and ease of use of an
enterprise appliance. This fully integrated, turnkey architecture is validated with the DGX
SuperPOD, and VAST Data Platform and is fully supported by VAST support services.
Regardless of previous HPC experience, DGX SuperPOD with the VAST Data Platform is
designed to make large-scale AI simpler, faster, and easier to manage for every
organization and their IT team. The VAST Data Platform as an enterprise NAS solution
has key capabilities that benefit organizations as they elevate their AI initiatives to DGX
SuperPOD scale:
> Exascale NFS that provides all the performance required for the most demanding
HPC and AI workloads without the complexity of parallel file system solutions.
> Seamlessly scale the infrastructure as your data sets and performance requirements
grow with zero downtime for upgrades or expansions.
> Exabyte-scale all NVMe namespace with archive tier economics that eliminates data
silos and tiers so that every I/O is served in real time.
> Native multi-protocol support enables NFS, SMB, and S3 access to all data without
gateways or addons.
Learn more about the NVIDIA / VAST partnership at: https://vastdata.com/nvidia.

NVIDIA DGX SuperPOD: VAST RA-11389-001 v1 | i

Contents
Storage Overview.............................................................................................................................................. 1
Storage Caching Hierarchy ..................................................................................................................... 1
Storage Performance Requirements .................................................................................................. 1
About the Vast Data Platform..................................................................................................................... 3
Validation Methodology ................................................................................................................................. 4
Microbenchmarks ....................................................................................................................................... 4
Hero Benchmark Performance.......................................................................................................... 5
Single-Node, Multi-File Performance .............................................................................................. 5
Multi-Node, Multi-File Performance ................................................................................................ 6
Single-File I/O Performance................................................................................................................ 6
Application Testing..................................................................................................................................... 7
ResNet-50 .................................................................................................................................................. 7
NLP—BERT ................................................................................................................................................ 8
Recommender—DLRM.......................................................................................................................... 8
Summary .............................................................................................................................................................. 9

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | ii

Storage Overview

Training performance may be limited by the rate at which data can be read and reread
from storage. The key to performance is the ability to read data multiple times, ideally
from local storage. The closer the data is cached to the GPU, the faster it can be read.
Both persistent and nonpersistent storage needs to be designed so as to balance the
needs of performance, capacity, and cost.

Storage Caching Hierarchy

The storage caching hierarchy for DGX systems is shown in Table 1. Depending on data
size and performance needs, each tier of the hierarchy can be leveraged to maximize
application performance.

Table 1.DGX system storage and caching hierarchy

Storage Hierarchy Level Technology Total Capacity1 Performance1
RAM DRAM 2 TB2 > 200 GiB/s
Internal storage Flash storage 30 TB3 > 50 GiB/s
1. Total capacity and performance values are per system.
2. Shared between the operating system, application, and other system processes
3. PCIe Gen 4 NVMe SSD storage

Caching data in local RAM provides the best performance for reads. This caching is
transparent once the data is read from the filesystem. While local storage is fast, it is
not practical to manage a dynamic environment with local disk alone in a multi-node
environment. Functionally, centralized storage may be as quick as local storage on many
workloads.

Storage Performance Requirements

Performance requirements for high-speed storage greatly depend on the types of AI
models and data formats used. The DGX SuperPOD has been designed as a capability-
class system that can manage any workload both today and in the future. However, if
systems are going to focus on a specific workload, such as natural language processing
(NLP), it may be possible to better estimate performance needs of the storage system.

NVIDIA DGX SuperPOD: VAST RA-11389-001 v1 | 1

To allow for customers to characterize their own performance requirements, some
general guidance on common workloads and datasets is shown Table 2.

Table 2: Characterizing different I/O workloads

Storage Example Workloads Dataset Size

Performance Level
Required

Good NLP Most to all datasets fit in cache

Better Image processing with Many to most datasets can fit within the
compressed images, local system’s cache.
ImageNet/ResNet-50

Best Training with 1080p, 4K, or Datasets are too large to fit into cache,
uncompressed images, offline massive first epoch I/O requirements,
inference, ETL workflows that only read the dataset once

Performance estimates for the storage system necessary to meet the guidelines in
Table 2 are in:
> Table 4 of the NVIDIA DGX SuperPOD Reference Architecture—DGX H100 Systems.
> Table 8 of the NVIDIA DGX SuperPOD Reference Architecture—DGX A100 Systems.
Achieving these performance characteristics may require the use of optimized file
formats such as TFRecord, RecordIO, or HDF5.
The high-speed storage provides a shared view of an organization’s data to all systems.
It needs to be optimized for small, random I/O patterns, and provide high peak system
performance and high aggregate filesystem performance to meet the variety of training
workloads an organization may encounter.

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 2

About the Vast Data Platform

The VAST Data Platform meets performance characteristics with an architecture that
delivers not just speed and scale but also operational efficiency and ease of use such
that any IT team can deploy and support large-scale AI initiatives with DGX SuperPOD.

Figure 1. Highly available, NVIDIA BlueField DPU integrated NVMe enclosure

VAST challenges the long-held assumption that NFS performance is inadequate for AI
and HPC workloads. The VAST Disaggregated, Shared-Everything architecture consists
of two building blocks that are scaled across a common NVMe fabric. First, the state and
storage capacity of the system is built from resilient, high-density NVMe-oF storage
enclosures. Second, the logic of the system is implemented by stateless containers that
each can connect to and manage all the media in the enclosures. By disaggregating
compute from storage, it is possible to spread I/O across the system to achieve levels of
parallelism for massive performance.
Other benefits of the VAST Data Platform for DGX SuperPOD include:
> Independent scaling of performance and capacity with support for mixed
generations of hardware in a single exabyte-scale namespace.
> Native support for both InfiniBand and Ethernet in the same namespace.
> Archive tier economics via next-generation global data reduction algorithms, support
for hyperscale QLC flash with ten years of endurance, and ultra-efficient locally
decodable erasure codes.
> Non-disruptive, online system expansions and software upgrades.
> Encryption, authentication, and external key management.
> VAST Catalog, a built-in metadata index allows customers to find and manage data
via SQL queries.
> Enterprise-grade data protection with support for n-1 and 1-n replication topologies
and up to 1 million ransomware-proof snapshots.
> The VAST Data GUI management interface provides thousands of metrics via an
API-first architecture unlocking real-time visibility into performance metrics.

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 3

Validation Methodology

Three classes of validation tests are used to evaluate a particular storage technology
and its configuration for use with the DGX SuperPOD: microbenchmark performance,
real application performance, and functional testing. The microbenchmarks measure key
I/O patterns for DL training and are designed so they can be run on CPU-only nodes. This
reduces the need for large GPU-based systems to validate storage. Real DL training
applications are then run on a DGX SuperPOD to confirm that the applications meet
expected performance. Beyond performance, storage solutions are evaluated for
robustness and resiliency as part of functional testing.
The NVIDIA DGX SuperPOD storage validation process leverages a “Pass or Fail”
methodology. Specific targets are set for the microbenchmark test. Each benchmark
result is graded as good, fair, or poor. A passing grade is one where at least 80% of the
tests are good, and none are poor. In addition, there must be no catastrophic issues
created during testing. For application testing, a passing grade is one where all cases
complete within 5% of the roofline performance set by running the same tests with data
staged on the DGX RAID. For functional testing, a passing grade is one where all
functional tests meet their expected outcomes.

Microbenchmarks
In the Storage Performance Requirements section, there are several high-level
performance metrics that storage systems must meet to qualify as a DGX SuperPOD
solution. Current testing requires that the solutions meet the “Best” criteria discussed in
the table. In addition to these high-level metrics, several groups of tests are run to
validate the overall capabilities of the proposed solutions. These include single-node
tests where the number of threads is varied and multi-node tests where a single thread
count is used and as the number of nodes vary. In addition, each test run in both
Buffered and DirectIO modes and when I/O is performed to separate files or when all
threads and nodes operate on the same file.
Four different read patterns are run. The first read operation is sequential where no data
is in the cache. The second read operation is executed immediately thereafter to
evaluate the ability for the filesystem to cache data. The cache is purged and then the
data are read again, this time randomly. Lastly, the data is reread again randomly, to
evaluate data caching.
The IOR benchmark for single-node and multi-node tests was used.

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 4

Hero Benchmark Performance
The hero benchmark helps establish the peak performance capability of the entire
solution. Storage parameters, such as filesystem settings, I/O size, and controlling CPU
affinity, were tuned to achieve the best read and write performance. Storage devices
were expected to demonstrate that quoted performance was close to measured
performance. Other tests are crafted to demonstrate performance of real workloads.
The delivered solution for a single SU had to demonstrate over 20 GiB/s for writes and
65 GiB/s for reads. Ideally, the write performance should be at least 50% of the read
performance. However, some storage architectures have a different balance between
read and write performance, so this is only a guideline and read performance is more
important than write.

Single-Node, Multi-File Performance

For single-node performance, I/O read and write performance is measured by varying the
number of threads in incremental steps. Each thread writes (and reads) to (and from) its
own file in the same directory.
For single-node performance tests, the number of threads is varied from 1 to the ideal
number of threads to maximize performance (typically more than half the cores 64, but
no more than the total physical cores, 128). The I/O size is varied between 128 KiB and
1 MiB and the tests are run with Buffered I/O and Direct I/O.
The target performance for these tests is shown in Table 3.

Table 3. Single-node, multi-file performance targets

Thread Buffered or I/O size Performance (MiB/s)
Count DirectIO (KiB) Write Read Reread Random Random
Read Reread
1 Buffered 128 512 1,024 1,536 256 1,536
1 Buffered 1024 800 3,072 4,608 768 1,024
1 Direct 1024 1,024 1,024 N/A 1,024 N/A

When maximizing single-node performance, the thread count may vary, however it is
expected that performance does not drop significantly when additional threads are used
beyond the optimal thread count.

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 5

Target performance for single-node performance with multiple threads is in Table 4. The
optimal number of threads may vary for any storage configuration.

Table 4. Single-node, multi-threaded performance targets

Thread Buffered or I/O size Performance (MiB/s)
Count DirectIO (KiB) Write Read Reread Random Random
Read Reread
Buffered 128 8,000 12,000 18,000 12,000 18,000
Direct 128 8,000 15,000 N/A 15,000 N/A
Varies
Buffered 1024 10,000 20,000 30,000 20,000 30,000
Direct 1024 10,000 20,000 N/A 20,000 N/A

Reread performance relative to read performance can vary substantially between

different storage solutions. The reread performance should be at least 50% of the read
performance for both sequential and random reads.

Multi-Node, Multi-File Performance

The next test performed is multi-node I/O read and write test to make sure that the
storage appliance can provide the minimum required buffered read and write per system
for the DGX SuperPOD. This benchmark determines the capacity of a filesystem to scale
performance of different I/O patterns. Performance should scale linearly from one to a
few nodes, reach a maximum performance, and not drop off significantly as more nodes
are added to the job.
The target performance for a single SU of 20 nodes is 65 GiB/s for reads with I/O size of
128 KiB or 1,024 KiB, and if the I/O is Direct or Buffered. The write performance should
be at least 20 GiB/s, but ideally it would be 50% of the read performance. Results from
these tests must be interpreted carefully as it is possible to add more hardware to
achieve these levels. Overall performance is the goal, but it is desirable that the
performance comes from an efficient architecture that is not over-designed for its use.

Single-File I/O Performance

A key I/O pattern is reading data from a single file. Often the fastest way to read data is
when all the data is organized into a single file, such as the RecordIO format. This can
often be the fastest way to read data because it eliminates any of the open and close
operations required when data are organized into multiple large files. Single-file reads
are a key I/O pattern on DGX SuperPOD configurations.
Targeted performance and expected I/O behavior is that the single-node, multi-
threaded, writes can successfully create the file, that sequential read and random read
performance is good, and that read performance scales as more nodes are used. Multi-
node, multi-threaded, single file writes are not evaluated. In addition, it is expected that
buffered reread performance is like the multi-file reread performance.

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 6

Target performance for single file I/O is in Table 5.

Table 5. Single-file read performance targets

Node Buffered or I/O size (KiB) Performance (MiB/s)
Count DirectIO Read Reread Random Read Random Reread
1 Buffered 128 2,500 11 2,500 11
1 Direct 128 15,000 N/A 15,000 N/A
1 Buffered 1024 3,000 11 3.000 11
1 Direct 1024 20,000 N/A 20,000 N/A
20 Buffered 128 65,000 11 65,000 11
20 Direct 128 65,000 N/A 65,000 N/A
20 Buffered 1024 65,000 11 65,000 11
20 Direct 1024 65,000 N/A 65,000 N/A
1. Reread performance of cached data should be near in performance to the results from the multi-file
reread test

Application Testing
Microbenchmarks provide indications of the peak performance of key metrics. However,
it is application performance that is most important. A subset of the MLPerf Training
benchmarks is used to validate storage performance and function. Here, both single-
node and multi-node configurations are evaluated to ensure that the filesystem can
support different I/O patterns and workloads. Training performance when data is staged
on the DGX RAID was used as the baseline for performance. The performance goal is for
the total time to train when data is staged on the shared filesystem to be within 5% of
those measured when data is staged on the local RAID. This is not just for individual
runs, but also when multiple cases are run across the DGX SuperPOD at the same time.

ResNet-50
ResNet-50 is the canonical image classification benchmark. Its dataset size is over
100 GiB and it has a requirement for fast data ingestion. On a DGX system, a single node
training requires approximately 3 GiB per second and the dataset is small enough that it
can fit into cache. Preprocessing can vary, but the typical image size is approximately
128 KiB. One challenge of this benchmark is that at NVIDIA the processed images are
stored in the RecordIO format (i.e. one large file for the entire dataset) since this
provides the best performance for MLPerf. Since it is a single file, this can stress shared
filesystem architectures that do not distribute the data across multiple targets or
controllers.

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 7

NLP—BERT
BERT is the reference standard NLP model. In this test, the system is filled with two
eight node jobs and four single node jobs (or less if not all 20 nodes of the SU are
available). It is expected that the total time to train is within 5% of that measured when
training from the local raid. This test does not stress the filesystem but does ensure
that local caching is operating as needed.

Recommender—DLRM
The recommender model has different training characteristics than ResNet-50 and
BERT in that the model trains in less than a single epoch. This means that the data set is
read no more than once, and local caching of data cannot be used. To achieve full
training performance, DLRM must be able to read data at over 6 GiB/s. In addition, the
file reader uses DirectIO that stresses the filesystem differently than the other two files.
The data are formatted into a single file.
This test is only run as a single node test; however, several tests are run where the
number of simultaneous jobs vary from one to the total number of nodes available. It is
expected that the shared filesystem only sustains performance up to the peak
performance measured from the hero test. For 20 simultaneously cases, the storage
system would have to provide of over 120 GiB/s of sustained read performance, more
than what is prescribed in the Storage Performance Requirements section. Even the
best performance outlined in this table is not meant to support every possible workload.
It is meant to provide a balance of high throughput while not over-architecting the
system.

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 8

Summary

The VAST Data Platform meets the DGX SuperPOD performance and functionality
requirements. As an enterprise NAS solution, the VAST Data Platform paired with DGX
SuperPOD enables customers to take on demanding AI workloads without the
complexity and specialized skills typically associated with HPC storage solutions.
As requirements grow the VAST Data Platform may be seamlessly scaled by adding
compute and or storage resources tailored to meet performance and capacity targets.
The VAST Data Platform supports multiple generations of infrastructure in a cluster,
allowing customers to mitigate supply chain issues and select from a range of VAST
certified hardware solutions.
DGX SuperPOD customers can be confident that the VAST Data Platform will meet their
most challenging AI workloads at any scale.

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 9

Notice
This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a
product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the
information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the
consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document
is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time
without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise
agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects
to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual
obligations are formed either directly or indirectly by this document.
NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at
customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of
each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information
contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the
application to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the
NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no
liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner
that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this
document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products
or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other
intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full
compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS
(TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR
OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT,
MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE
FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES,
HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s
aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the
product.

Trademarks
NVIDIA, the NVIDIA logo, NVIDIA DGX POD, NVIDIA DGX SuperPOD are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and
other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

NVIDIA Corporation | 2788 San Tomas Expressway, Santa Clara, CA 95051

http://www.nvidia.com

Dell PowerScale Storage Reference Architecture For NVIDIA DGX SuperPOD
No ratings yet
Dell PowerScale Storage Reference Architecture For NVIDIA DGX SuperPOD
14 pages
NVIDIA DGX SuperPOD - NetApp EF600 and BeeGFS
No ratings yet
NVIDIA DGX SuperPOD - NetApp EF600 and BeeGFS
14 pages
RA11334001 DSPB200 ReferenceArch
No ratings yet
RA11334001 DSPB200 ReferenceArch
29 pages
Nvidia DGX Superpod Datasheet Us Web
No ratings yet
Nvidia DGX Superpod Datasheet Us Web
3 pages
DGX Superpod Reference Architecture DGX h100
No ratings yet
DGX Superpod Reference Architecture DGX h100
27 pages
RA 11336 001 DSPH200 ReferenceArch
No ratings yet
RA 11336 001 DSPH200 ReferenceArch
34 pages
Ra 11127 001 Dbphb100 Referencearch
No ratings yet
Ra 11127 001 Dbphb100 Referencearch
18 pages
NVIDIA DGX SuperPOD With DGX GB200 Systems
No ratings yet
NVIDIA DGX SuperPOD With DGX GB200 Systems
3 pages
Accelerating AI With Storage Scale
No ratings yet
Accelerating AI With Storage Scale
19 pages
DGX A100 System Architecture Whitepaper
No ratings yet
DGX A100 System Architecture Whitepaper
23 pages
Nvidia-Dgx-A100-80gb-Datasheet 08.09.2022
No ratings yet
Nvidia-Dgx-A100-80gb-Datasheet 08.09.2022
2 pages
Nvidia DGX A100 System 80gb Datasheet Web Us
No ratings yet
Nvidia DGX A100 System 80gb Datasheet Web Us
2 pages
NVIDIA DGX A100 System Architecture Datasheet
No ratings yet
NVIDIA DGX A100 System Architecture Datasheet
2 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
DGX Station Fujitsu
No ratings yet
DGX Station Fujitsu
2 pages
Nvidia DGX Platform Solution Overview Web Us
No ratings yet
Nvidia DGX Platform Solution Overview Web Us
4 pages
S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm
No ratings yet
S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm
31 pages
Scalable AI Infrastructure: Designing For Real-World Deep Learning Use Cases
No ratings yet
Scalable AI Infrastructure: Designing For Real-World Deep Learning Use Cases
12 pages
Nvidia DGX A100 Datasheet PDF
No ratings yet
Nvidia DGX A100 Datasheet PDF
2 pages
Nvidia DGX Station Datasheet PDF
No ratings yet
Nvidia DGX Station Datasheet PDF
2 pages
Nvidia DGX h100 Datasheet
No ratings yet
Nvidia DGX h100 Datasheet
2 pages
(R) Dell EMC PowerScale and NVIDIA DGX A100 Systems For Deep Learning
No ratings yet
(R) Dell EMC PowerScale and NVIDIA DGX A100 Systems For Deep Learning
19 pages
A100 80gb HGX A100 Datasheet Us Nvidia 1485640 r6 Web
No ratings yet
A100 80gb HGX A100 Datasheet Us Nvidia 1485640 r6 Web
3 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
DDN A3i Solutions
No ratings yet
DDN A3i Solutions
20 pages
Nvidia DGX h100 Datasheet Nvidia Us Web
No ratings yet
Nvidia DGX h100 Datasheet Nvidia Us Web
2 pages
Nvidia DGX Base Command Platform Datasheet Us Web
No ratings yet
Nvidia DGX Base Command Platform Datasheet Us Web
2 pages
AI Storage With Ceph
No ratings yet
AI Storage With Ceph
17 pages
Nvidia DGX Station A100 Datasheet
No ratings yet
Nvidia DGX Station A100 Datasheet
2 pages
DGX Spark - Your Personal AI Supercomputer
No ratings yet
DGX Spark - Your Personal AI Supercomputer
32 pages
Finance Trading Executive Briefing HR Web
No ratings yet
Finance Trading Executive Briefing HR Web
7 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
AI Practice Profile
No ratings yet
AI Practice Profile
14 pages
NVIDIA's AI Stack
No ratings yet
NVIDIA's AI Stack
14 pages
Fast-Track Generative AI with NVIDIA
No ratings yet
Fast-Track Generative AI with NVIDIA
27 pages
GTC2025 Keynote
No ratings yet
GTC2025 Keynote
73 pages
DDN A3i Ai400x2 Nvidia DGX h100 Basepod Reference Architecture
No ratings yet
DDN A3i Ai400x2 Nvidia DGX h100 Basepod Reference Architecture
24 pages
Nvswitch Technical Overview
No ratings yet
Nvswitch Technical Overview
8 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
Nvidia Update For Lenovo
No ratings yet
Nvidia Update For Lenovo
30 pages
NVIDIA GPU Innovations for AI Experts
100% (1)
NVIDIA GPU Innovations for AI Experts
96 pages
Nvidia DGX Pod For Prescriptive Health and Maintenance of Oil and Gas Systems
No ratings yet
Nvidia DGX Pod For Prescriptive Health and Maintenance of Oil and Gas Systems
21 pages
Nvidia DGX Pod Data Center Reference Design
No ratings yet
Nvidia DGX Pod Data Center Reference Design
19 pages
DGX Basepod Deployment Guide DGX A100
No ratings yet
DGX Basepod Deployment Guide DGX A100
110 pages
Nvidia Base Command
No ratings yet
Nvidia Base Command
2 pages
Nvidia DGX Superpod Data Center Design DGX h100
No ratings yet
Nvidia DGX Superpod Data Center Design DGX h100
49 pages
NVT Certification Exam Study Guide Aiio Web
0% (1)
NVT Certification Exam Study Guide Aiio Web
6 pages
HPC Day 12 ppt-2
No ratings yet
HPC Day 12 ppt-2
139 pages
Nvidia DGX Station Print Infographic 738375 Web
No ratings yet
Nvidia DGX Station Print Infographic 738375 Web
1 page
NVIDIA GPU Evolution: Gaming to AI
100% (1)
NVIDIA GPU Evolution: Gaming to AI
91 pages
Nvidia DGX Spark Workstation Datasheet
No ratings yet
Nvidia DGX Spark Workstation Datasheet
3 pages
DGX Pod Reference Design Whitepaper
No ratings yet
DGX Pod Reference Design Whitepaper
15 pages
Dgx1 v100 System Architecture Whitepaper
No ratings yet
Dgx1 v100 System Architecture Whitepaper
43 pages
Zhongliang Chen Thesis
No ratings yet
Zhongliang Chen Thesis
71 pages
A21209 Accelerating Storage With Magnum IO and GPUDirect Storage - 1602092945212001mbEW
No ratings yet
A21209 Accelerating Storage With Magnum IO and GPUDirect Storage - 1602092945212001mbEW
40 pages
H: A D D C S A D L T C: Oard Istributed ATA Aching Ystem To Ccelerate EEP Earning Raining On The Loud
No ratings yet
H: A D D C S A D L T C: Oard Istributed ATA Aching Ystem To Ccelerate EEP Earning Raining On The Loud
12 pages
H3C MSR3600 Series Router Datasheet
No ratings yet
H3C MSR3600 Series Router Datasheet
15 pages
Magic Quadrant for Content Services Platforms
No ratings yet
Magic Quadrant for Content Services Platforms
27 pages
H3C MSR3620-X1 Router Test Report
No ratings yet
H3C MSR3620-X1 Router Test Report
12 pages
06 GB0-392 Exam Syllabus For The H3CSE-RS-NSO
No ratings yet
06 GB0-392 Exam Syllabus For The H3CSE-RS-NSO
4 pages
6643690f56a51719abfa0901 - Gartner Market Guide For NDR
No ratings yet
6643690f56a51719abfa0901 - Gartner Market Guide For NDR
18 pages
01 Training Outline For The H3CSE-RS-SW Advanced Routing - Switching Technology 1
No ratings yet
01 Training Outline For The H3CSE-RS-SW Advanced Routing - Switching Technology 1
4 pages
A00009581enw Hyperconverged Infrastructure For Data Protection PDF
No ratings yet
A00009581enw Hyperconverged Infrastructure For Data Protection PDF
30 pages
04-Layer 2-LAN Switching Configuration Guide-Book
No ratings yet
04-Layer 2-LAN Switching Configuration Guide-Book
221 pages
06-Layer 2-WAN Access Configuration Guide-Book
No ratings yet
06-Layer 2-WAN Access Configuration Guide-Book
103 pages
HCI Goes Main Stream
No ratings yet
HCI Goes Main Stream
6 pages
The Forrester Wave™ Agile Content Management Systems (CMSes), Q1 2021-1
No ratings yet
The Forrester Wave™ Agile Content Management Systems (CMSes), Q1 2021-1
18 pages
Network Detection & Response Guide
No ratings yet
Network Detection & Response Guide
16 pages
How To Write An Effective RFP For B2B E-Commerce
No ratings yet
How To Write An Effective RFP For B2B E-Commerce
15 pages
Gorilla Guide To Hyperconverged Infrastructure For Tier 1dedicated Apps-A00015105enw
No ratings yet
Gorilla Guide To Hyperconverged Infrastructure For Tier 1dedicated Apps-A00015105enw
19 pages
The Gorilla Guide To HCI - Technical Overview-A00078608enw
No ratings yet
The Gorilla Guide To HCI - Technical Overview-A00078608enw
81 pages
Gorilla Guide To Hyperconverged Infrastructure For Cloud-A00009575enw
No ratings yet
Gorilla Guide To Hyperconverged Infrastructure For Cloud-A00009575enw
26 pages
C7000 BladeSystem EOSL
No ratings yet
C7000 BladeSystem EOSL
13 pages
B2B E-Commerce RFP Template by Liferay
No ratings yet
B2B E-Commerce RFP Template by Liferay
30 pages
Ant Media Server Enterprise and Community
No ratings yet
Ant Media Server Enterprise and Community
5 pages
En - Maevex 6100 Datasheet
No ratings yet
En - Maevex 6100 Datasheet
4 pages
6.2 FAQ22 CCC Checklist For Security Risk Assessment and Audit
No ratings yet
6.2 FAQ22 CCC Checklist For Security Risk Assessment and Audit
2 pages
En Maevex 6152 Encoder Datasheet
No ratings yet
En Maevex 6152 Encoder Datasheet
4 pages
7.0 - Contact Us
No ratings yet
7.0 - Contact Us
2 pages
5.0 - Fund Contribution
No ratings yet
5.0 - Fund Contribution
2 pages
3.3 - Database-as-a-Service (DBaaS)
No ratings yet
3.3 - Database-as-a-Service (DBaaS)
2 pages
GCIS IaaS: Cloud Services Overview
No ratings yet
GCIS IaaS: Cloud Services Overview
2 pages
6.1 FAQ21 CCC Checklist For Load Test
No ratings yet
6.1 FAQ21 CCC Checklist For Load Test
2 pages
GCIS Overview and Service Guidelines
No ratings yet
GCIS Overview and Service Guidelines
27 pages
3.0 Services Information
No ratings yet
3.0 Services Information
2 pages
4.0 - Getting Started
No ratings yet
4.0 - Getting Started
3 pages
DVF 5000 2nd Gen 5-Axis Machining Center
100% (1)
DVF 5000 2nd Gen 5-Axis Machining Center
24 pages
PPT ch14
No ratings yet
PPT ch14
62 pages
Digital Color Ultrasonography Specs
No ratings yet
Digital Color Ultrasonography Specs
2 pages
Iso 12647 4 2005
No ratings yet
Iso 12647 4 2005
11 pages
En - STM32WL Peripheral LoRaStack - LORASTACK
No ratings yet
En - STM32WL Peripheral LoRaStack - LORASTACK
21 pages
ThinkSystem DS Series SSF - DS6200 PDF
No ratings yet
ThinkSystem DS Series SSF - DS6200 PDF
32 pages
Virtual HIL Device Quick Start Guide
No ratings yet
Virtual HIL Device Quick Start Guide
6 pages
Event Registration: Minor Project Report On
No ratings yet
Event Registration: Minor Project Report On
37 pages
Certificate in MS-Office (4-Months)
No ratings yet
Certificate in MS-Office (4-Months)
27 pages
HarmonicMediaGrid FSD 4.0 UserGuide
No ratings yet
HarmonicMediaGrid FSD 4.0 UserGuide
52 pages
JSS 3 ICT Examination Paper
No ratings yet
JSS 3 ICT Examination Paper
9 pages
Syllabus-Big Data Visulaization
No ratings yet
Syllabus-Big Data Visulaization
2 pages
ICT Exp 02
No ratings yet
ICT Exp 02
3 pages
WTC Software User Manual en
No ratings yet
WTC Software User Manual en
40 pages
Mastering Microsoft Word - From Beginner To Expert
100% (2)
Mastering Microsoft Word - From Beginner To Expert
58 pages
21st Century Literature q2 Mod 3 Creative Literary Adaptation v2
No ratings yet
21st Century Literature q2 Mod 3 Creative Literary Adaptation v2
36 pages
8051 Microcontroller Instruction Set Guide
No ratings yet
8051 Microcontroller Instruction Set Guide
76 pages
Labview Application Builder User Guide: Instruments Software License Agreement Located On The Labview
No ratings yet
Labview Application Builder User Guide: Instruments Software License Agreement Located On The Labview
10 pages
Top Reasons To Buy The Xerox EX C60/C70 Print Server v2.0
No ratings yet
Top Reasons To Buy The Xerox EX C60/C70 Print Server v2.0
6 pages
Advanced Power Builder
No ratings yet
Advanced Power Builder
266 pages
Fidele SlidesCarnival
No ratings yet
Fidele SlidesCarnival
29 pages
Python Programming - The Crash Course For Python Learn The Secrets of Machine Learning, Data Science Analysis and Artificial Intelligence. Introduction To Deep Learning For Beginners (2
No ratings yet
Python Programming - The Crash Course For Python Learn The Secrets of Machine Learning, Data Science Analysis and Artificial Intelligence. Introduction To Deep Learning For Beginners (2
78 pages
DX Diag
No ratings yet
DX Diag
41 pages
(700GB+) Digital Products
No ratings yet
(700GB+) Digital Products
19 pages
Cse304 Computer Graphics and Visualization
No ratings yet
Cse304 Computer Graphics and Visualization
1 page
Hyper-V Cmdlets in Windows PowerShell PDF
No ratings yet
Hyper-V Cmdlets in Windows PowerShell PDF
246 pages
SNOW Basic Q&A
No ratings yet
SNOW Basic Q&A
6 pages
Elitehubs Quotation-2
No ratings yet
Elitehubs Quotation-2
3 pages
Advanced Digital System Design Using SoC FPGAs
100% (1)
Advanced Digital System Design Using SoC FPGAs
435 pages
IRemovaLPro Guide - Fixes
No ratings yet
IRemovaLPro Guide - Fixes
4 pages

Tme303 Dspa100 Ra1138901 Vast

Uploaded by

Tme303 Dspa100 Ra1138901 Vast

Uploaded by

NVIDIA DGX SuperPOD: VAST

NVIDIA DGX SuperPOD: VAST RA-11389-001 v1 | i

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | ii

Storage Caching Hierarchy

Table 1.DGX system storage and caching hierarchy

Storage Performance Requirements

NVIDIA DGX SuperPOD: VAST RA-11389-001 v1 | 1

Table 2: Characterizing different I/O workloads

Storage Example Workloads Dataset Size

Good NLP Most to all datasets fit in cache

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 2

Figure 1. Highly available, NVIDIA BlueField DPU integrated NVMe enclosure

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 3

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 4

Single-Node, Multi-File Performance

Table 3. Single-node, multi-file performance targets

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 5

Table 4. Single-node, multi-threaded performance targets

Reread performance relative to read performance can vary substantially between

Multi-Node, Multi-File Performance

Single-File I/O Performance

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 6

Table 5. Single-file read performance targets

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 7

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 8

NVIDIA DGX SuperPOD: VAST Reference Architecture RA-11389-001 v1 | 9

NVIDIA Corporation | 2788 San Tomas Expressway, Santa Clara, CA 95051

You might also like