Gaudi2 Whitepaper
Gaudi2 Whitepaper
HABANA® GAUDI®2
WHITE PAPER
MAY 2022
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
`
Gaudi®2 White Paper 2
Table of Contents
I. INTRODUCTION .................................................................................................................................................................. 3
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 3
I. INTRODUCTION
Gaudi®2 is Habana’s second-generation deep learning accelerator
supporting Training and Inference. Building on the architecture of Gaudi,
which launched first on the AWS EC2 cloud in the DL1 instance, and on-
premises via the Supermicro X12 servers, Gaudi2 brings a new level of
performance and efficiency to deep learning in the datacenter and cloud.
The main benefits that current customers of our first-generation Gaudi see
are the price-performance advantages relative to GPU solutions for popular
vision and language models. These enable customers to train more and pay
less, and this way, accelerate time-to-market with their model training.
With Gaudi2, we are pleased to extend these benefits to not just better
price-performance, but also performance leadership vs. the leading shipping
7nm GPU today, namely A100.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 4
Before we go into the architecture details, here are some key benchmarks for
Gaudi2 at time of publication of this white paper:
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 5
A100-80GB: Measured by Habana, Jan 2022, on Azure instance Standard_ND96amsr_A100_v4 using single A100-80GB with TF docker
21.02-tf2-py3 from NGC (Phase-1: Seq len=128, BS=312, accu steps=1024;
Phase-2: seq len=512, BS=40, accu steps=3072)
A100-40GB: Measured by Habana, Jan 2022, on DGX-A100 using single A100-40GB with TF docker 21.12-tf2-py3 from NGC (Phase-1:
Seq len=128, BS=64, accu steps=1024;
Phase-2: seq len=512, BS=16, accu steps=2048)
V100-32GB: Measured by Habana, Jan 2022, on p3dn.24xlarge using single V100-32GB with TF docker 21.12-tf2-py3 from
NGC (Phase-1: Seq len=128, BS=64, accu steps=1024;
Phase-2: seq len=512, BS=8, accu steps=4096)
Gaudi2: Measured by Habana, Apr 2022, on Gaudi2-HLS system using single Gaudi2 with SynapseAI TF docker 1.4.0-435 (Phase-1:
Seq len=128, BS=64, accu steps=1024;
Phase-2: seq len=512, BS=16, accu steps=2048)
Results may vary.
Habana reported these initial results on the first models ported from Gaudi to
Gaudi2. While our OEM partners are working on building servers for general
consumption, Habana’s engineers are working these days on porting and
developing additional deep-learning models, with a cadence of 6-8 weeks for
software releases. You can track our progress through our public Github and
developer.habana.ai site.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 6
The compute architecture is heterogeneous and enable inter-Gaudi communication via direct routing
includes two compute engines – a Matrix or via standard Ethernet switching.
Multiplication Engine (MME) and a fully
programmable Tensor Processor Core (TPC) cluster. The Gaudi2 Memory subsystem includes 96 GB of
The MME is responsible for doing all operations HBM2E memories delivering 2.45 TB/sec bandwidth,
which can be lowered to Matrix Multiplication (fully in addition to a 48 MB of local SRAM with sufficient
connected layers, convolutions, batched-GEMM) bandwidth to allow MME, TPC, DMAs and RDMA
while the TPC, a VLIW SIMD processor tailor-made NICs to operate in parallel.
for deep learning operations, is used to accelerate
Specifically for Vision applications, Gaudi2 has
everything else. Besides MME & TPC, Gaudi2 is also integrated media decoders which operate
instancing several DMAs which are coupled with a
independently and can handle the entire pre-
transpose engine for efficient, on the fly, tensor
processing pipe in all popular formats – HEVC, H.264,
shape transformations, in addition to the ability to
VP9 & JPEG as well as post-decode image
read & write non-contiguous multi-dimensional
transformations needed to prepare the data for the
tensors from and to the Gaudi2 memory subsystem.
AI pipeline.
The Gaudi2 Processor offers 2.4 Terabits of
Gaudi2 supports all popular data types required for
networking bandwidth with the native integration
deep learning: FP32, TF32, BF16, FP16 & FP8 (both
on-chip of 24 x 100 Gbps RoCE V2 RDMA NICs, which E4M3 and E5M2). In the MME, all data types are
accumulated into an FP32 accumulator.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 7
The below diagram shows what is usually observed on GPUs where the GEMM
compute and the general-purpose cores execution time is not overlapping:
GEMM GEMM
On Gaudi and Gaudi2 architectures, the MME and TPC compute time is overlapping
such that the GEMM and none-GEMM operations are mostly overlapped to
dramatically accelerate the workload.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
`
Gaudi®2 White Paper 8
GEMM GEMM
Another big difference between GPU and Gaudi architecture is the size of the matrix
multiplication accelerator. This fact, which may seem minor, has a big effect on overall
ability to utilize those accelerators, specifically when matrix sizes become smaller.
The below diagram compares a 256x256 matrix accelerator (on the left) to 256 small
16x16 matrix accelerators on the right (the depth dimension was removed to simplify
the explanation). From compute perspective, both are equivalent, but from bandwidth
perspective, although the left one requires 512 input elements per cycle to utilize the
compute, the right side requires 8K input elements per cycle to utilize the compute. A
16x difference on the read bandwidth requirements towards the first level memory.
256
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
256 Elem/Cycle 256x256 256x16 Elem/Cycle 16x16
16x16
GPUs, which implement the right side, compensate for the above phenomena, by
mandating a large reuse factor from the hierarchical caching memory sub system,
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 9
such that overall, to utilize their multipliers, they require a very big matrix
multiplication problem to solve. Gaudi2 (and some other dedicated tensor
processors), which implement the left side approach, can utilize their multipliers
easily while leaving a lot of free bandwidth & capacity from their flat memory
subsystem for other tasks, besides matrix multiplications.
Such high utilization on small tensors significantly eases MME & TPC computation
overlapping, as to allow such tight overlapping, an operation needs to be sliced as
described below, which creates smaller tensors for MME & TPC to operate upon:
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 10
Gaudi2 integrates Habana’s fourth generation extending the RoCE scalability with flexible time-
Tensor Processor Core. The TPC is a general purpose based congestion control algorithm, enabling linear
VLIW processor which is 256B SIMD wide and scalability over thousands of Gaudi system.
supports FP32, BF16, FP16 & FP8 (Both E4M3 and
E5M2), in addition to INT32, INT16 & INT8 data DNN topologies tend to use collective operations
types. As opposed to common DSPs, which require a extensively and posting collective operations on
DMA to fetch in and out the operands to a local multiple ports usually requires high CPU
SRAM, the TPC, using advanced micro-architectural horsepower. To reduce the CPU utilization a scalable
techniques, exposes a DMA-free programming collective offload was introduced on Gaudi2 which
model which significantly eases SW development. In helps Gaudi’s 2 message rate to be more than an
addition, the same advanced microarchitecture order of magnitude better than competition.
allows bubble-free execution between kernels which Gaudi’s NICs are also aligned with all other engines
effectively makes TPC 100% utilized on tensor in chip and can access both local and remote
processing, even for very short kernels, regardless of memory in tensor semantics.
the location of its inputs/outputs (SRAM or HBM).
Just like MME, TPC is also very efficient in working
on small tensors.
To summarize, Gaudi heterogenous
As deep learning training is usually solved on
multiple devices, Gaudi2 Network Interface architecture is unique in the sense that
controllers (NICs) are an essential component in the
overall Habana second-generation training solution.
it is highly efficient on small tensors’
Gaudi’s NIC is customized to fit a distribution of a operations, which is an enabler for
DNN graph between the chips in the network (AKA
scale-out). The NIC provides the compute engine overlapping the computation &
with a remote direct memory access (RDMA)
featuring high bandwidth and low latency over networking communication between
reliable connection without any software
the heterogenous agents, in addition to
intervention. To fit common cloud infrastructure,
NIC ports use Ethernet connectivity with aggregated freeing up significant memory capacity
bandwidth of 2.4Tb/s, supporting multiple port
configurations. The NIC implements the RoCE v2 and bandwidth requirements from its
specification, benefiting from the commonly used
Ethernet infrastructure and the reliable and low memory subsystem.
latency RDMA of the InfiniBand protocol, while
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
`
Gaudi®2 White Paper 11
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 12
Graph Compiler and Runtime we have used this very SDK to build the high-
The SynapseAI graph compiler generates optimized performance kernels provided by Habana. Users can
binary code that implements the given model thereby develop customized deep learning models
topology on Gaudi. It performs operator fusion, data and algorithms on Gaudi to innovate and optimize to
layout management, parallelization, pipelining and their unique requirements.
memory management, and graph-level optimizations.
The TPC programming language, TPC-C, is a derivative
The graph compiler uses the rich TPC kernel library,
of C99 with added language data types to enable easy
which contains a wide variety of performance-
utilization of processor-unique SIMD capabilities. It
optimized operations (for example, elementwise,
natively supports wide vector data types to assist with
non-linear, non-GEMM operators). Given the
programming of the SIMD engine (for example,
heterogenous nature of Gaudi hardware (Matrix
float64, uchar256 and so on). It has many built-in
Math engine, TPC and DMA), the SynapseAI graph
instructions for deep learning, including tensor-based
compiler enables effective utilization through parallel
memory accesses, acceleration for special functions,
and pipelined execution of framework graphs.
random number generation and multiple data types.
SynapseAI uses stream architecture to manage
concurrent execution of asynchronous tasks, DL Framework Integration
supporting Gaudi’s unique combination of compute Habana SynapseAI integrates PyTorch and
and networking, exposing a multi-stream architecture TensorFlow, two of the most popular frameworks
to the framework. Streams of different types — used by data scientists and AI developers. This
compute, networking, and DMA — are synchronized section provides a brief overview of the SynapseAI
with one another at high performance and with low TensorFlow integration. It illustrates how SynapseAI
run-time overheads. does much of the mapping and optimization under
the hood, while customers still enjoy the same
Habana Communication Libraries
abstraction, they are accustomed to today.
The Habana Communication Library enables efficient
scale-up communication between Gaudi processors The SynapseAI TensorFlow bridge receives a
within a single node and scale-out across nodes for computational graph of the model from the
distributed training, leveraging Gaudi’s high TensorFlow framework and identifies the subset of
performance RDMA communication capabilities. It the graph that can be accelerated by Gaudi. These
has an MPI look-and-feel and supports point-to- point subgraphs are encapsulated and executed optimally
operations (for example, Write, Send) and collective on Gaudi. FIGURE 5 shows an example of
operations (for example, AllReduce, AlltoAll) that are encapsulation performed on the TensorFlow
performance-optimized for Gaudi. Habana Collective framework graph. The yellow node is not supported
Communications Library (HCCL) that is Habana’s on Gaudi, while blue nodes can execute on Gaudi.
implementation of standard collective Subgraphs with blue nodes are identified and
communication routines with NCCL-compatible API. encapsulated. The original graph is modified to
replace the subgraphs with their corresponding
TPC Programming encapsulated nodes.
The SynapseAI TPC SDK includes an LLVM-based TPC-
C compiler, a simulator and debugger. These tools The framework runtime then executes the modified
facilitate the development of custom TPC kernels, and graph. Per node, a corresponding SynapseAI graph is
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
`
Gaudi®2 White Paper 13
created and compiled. For performance bridge internally accumulates these in a graph. The
optimization, the compilation recipe is cached for execution of the accumulated operators in the graph
future use. After allocating memory, the recipe is is triggered in a “lazy” manner, only when a tensor
enqueued for execution on a SynapseAI stream. value is required by the user. This allows the bridge to
construct a graph, which provides the SynapseAI
SynapseAI supports distributed training with graph compiler the opportunity to optimize the
TensorFlow using Horovod and tf.distribute API with device execution for the operators.
HPUStrategy. Mixed precision execution is available
via the tf.keras.mixed_precision API or using Mixed precision execution is available via the Habana
Habana’s automated mixed precision conversion. Mixed Precision (HMP) package. The HMP package
These enable you to run mixed precision training automatically modifies the Python operators to add
without extensive modifications to existing FP32 the appropriate cast operations, and this enables you
model scripts. More details are available in the to run mixed precision training without extensive
TensorFlow section on docs.habana.ai. modifications to existing FP32 model scripts.
SynapseAI PyTorch bridge supports distributed
The SynapseAI PyTorch bridge interfaces between the training using torch.distributed and
framework and SynapseAI software stack to train torch.nn.parallel.DistributedDataParallel APIs for
PyTorch-based deep learning models on Gaudi. We both data and model parallelism. Distributed
support two modes of execution: (1) Eager mode, communication is enabled using HCCL backend. For
which performs operator-by-operator execution as more details, check out the PyTorch section on
defined in standard PyTorch eager mode scripts, and docs.habana.ai.
(2) Lazy mode, which performs deferred execution of
graphs comprising a collection of operators. Lazy SynapseAI is also integrated with TensorBoard to
mode provides user experience like Eager mode, enable debugging and profiling of your TensorFlow or
while enabling high performance on Gaudi. By PyTorch models. Users interested in low-level focused
default, we enable lazy mode exaction. Instead of profiling can refer to the SynapseAI Profiler User
executing one operator at a time, the SynapseAI Guide on docs.habana.ai
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 14
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
`
Gaudi®2 White Paper 15
The Resources section contains a collection of documents, short videos and hands-
on Jupyter notebook tutorials to help you get started and running models on Gaudi.
And for IT and Systems Administrators building Gaudi-based systems on premise, we
provide guidance on set-up and management for Gaudi servers and computing
infrastructure.
Habana GitHub contains repositories open to the general public, which include setup
and install instructions for Habana binaries and docker creation, Jupyter notebook-
based tutorials, reference models, custom TPC kernel example, and more.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 16
Containers can be deployed easily and consistently, regardless of whether the target
environment is a private data center or the public cloud. The Gaudi-optimized
frameworks containers are delivered with all necessary dependencies including the
complete SynapseAI software.
Orchestration Kubernetes
The Habana Community Forum is a dynamic resource for the developer community
to access answers to their questions when implementing or managing Gaudi-based
systems, and to share their own insights and perspectives with others who are
working with Habana and Gaudi. We invite you to join the Forum and help build a
robust and vital community of AI thought-leaders and builders who seek to leverage
the unprecedented benefits of Habana’s AI processors.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 17
V. MODEL MIGRATION
Switching from a familiar DL platform and workflow to a new one takes effort. Our
goal is to minimize this effort and lower the barriers wherever possible. We expect
most users will be able to take existing models with minor changes to existing
scripts and run on Gaudi. Habana GitHub will contain migration guides and
examples to assist users with porting their current models to run on Gaudi. More
information on migrating models to Gaudi is available in the Migration Guide on
docs.habana.ai.
The SynapseAI TensorFlow and PyTorch user guides provide an overview of SynapseAI
integration with respective frameworks, APIs and operators that are supported, how
to enable mixed precision and distributed training, and more. The migration guides
helps users develop a better understanding of how to port their existing models to
Gaudi and provide practical tips to assist in their effort.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 18
Below we show the minimum set of changes required to port a TensorFlow Keras
model that does not contain any custom kernels.
import tensorflow as tf
from TensorFlow.common.library_loader import load_habana_module
load_habana_module()
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10),
])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=128)
d l l t ( t t t t)
The minimal changes to enable training on the Habana Gaudi device are highlighted
in bold. All you need is to import load_habana_module package and then invoke
the load_habana_module() function to enable training on Gaudi. With this change,
the Gaudi device, which is referred to as HPU in the framework, is now registered in
TensorFlow and prioritized for execution over CPU. When an operator is available
for both CPU and HPU, the operator is assigned to the HPU. When it is not
supported on Gaudi, it runs on the CPU. For more details on porting your
TensorFlow model to Gaudi processors, check out the TensorFlow Migration Guide.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 19
V. ECOSYSTEM PARTNERSHIPS
The AI software ecosystem is rapidly expanding with research breakthroughs being
quickly integrated into popular software packages in a scalable and hardware
agnostic fashion. Data scientists and AI developers are adopting these software
solutions to help them focus more on the data science and research, and less on
managing the complexities of underlying software engineering. At Habana, our aim
is to meet the developers where they are. We have been busy collaborating with AI
software ecosystem partners to enable a seamless user experience with Habana AI
processors.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 20
There are two main classes one needs to know: (1) GaudiTrainer class that takes
care of compiling (lazy or eager mode) and distributing the model to run on HPUs,
and of performing training and evaluation, and (2) GaudiConfig class to configure
Habana Mixed Precision and decide whether optimized operators and optimizers
should be used. The GaudiTrainer is very similar to the Transformers Trainer and
adapting a script using the Trainer to make it work with Gaudi will mostly consist in
simply swapping the Trainer class for the GaudiTrainer one. The example in the
picture above shows how simple it is to get started with training Transfomer models
on Gaudi. Several popular reference models are available on the HuggingFace
Habana page, including, bert-base, bert-large, roberta-case, roberta-large,
distilbert-base, albert-large and albert-xxlarge.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 21
# Loading the GaudiConfig needed by the GaudiTrainer to fine-tune the model on HPUs
gaudi_config = GaudiConfig.from_pretrained(
training_args.gaudi_config_name,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
trainer = GaudiTrainer(
model=model,
gaudi_config=gaudi_config,
# The training arguments differ a bit from the original ones, that is why we use
GaudiTrainingArguments
args=training_args,
compute_metrics=compute_metrics,
tokenizer=tokenizer,
data_collator=data_collator,
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 22
Below you will see how easy it is to get started with training on Gaudi using PyTorch
Lightning
import pytorch_lightning as pl
All you need is to provide accelerator="hpu" parameter to the Trainer class, and
select the number of Gaudi processors by setting the devices parameter. For mixed
precision training, import the HPUPrecisionPlugin and set “precision=16”.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 23
With cnvrg.io, data scientists can deploy more models with drag and drop machine
learning pipelines. You can easily run and track experiments, and automate your
machine learning from research to production using reusable components and drag-
n-drop interface. Getting started with Habana Gaudi on cnvrg.io first requires
setting up a Kubernetes cluster for your on-premise Gaudi servers or an Amazon
EKS cluster using DL1 EC2 instances. cnvrg.io seamlessly integrates both on-
premises and cloud compute resources. The Habana Vault, which hosts the
SynapseAI TensorFlow and PyTorch Docker container images, is integrated and
available on cnvrg.io Registries. You can now bring up a new Jupyter Workspace,
select the appropriate Gaudi compute and Docker image from the cnvrg.io Habana
container registry. You can then get started with the Habana reference models by
simply adding the repo location in the cnvrg Project Settings Git Integration page.
Now you can start a new Experiment in cnvrg.io and begin training your model on
Gaudi.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 24
Habana has worked with both server, switch, and storage systems partners to make
it easy for end-customers to build AI racks and clusters.
The figure below shows a rack-scale configuration with four Gaudi servers
connected to a single Ethernet switch at the top of the rack. This switch can be
further connected to other racks to form a much larger training pod that can hold
hundreds or thousands of Gaudi processors.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 25
The DDN A3I scalable architecture integrates X12 Gaudi AI servers with DDN AI
shared parallel file storage appliances and delivers fully optimized end-to-end AI
acceleration on Habana Gaudi AI processors. DDN A3I solutions greatly simplify the
deployment of X12 Gaudi AI servers in single server and multi-server configurations,
while also delivering performance and efficiency for maximum Habana Gaudi AI
processors saturation, and high levels of scalability.
This section describes the components integrated in DDN A3I Solutions with
Supermicro X12 Gaudi AI servers.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 26
As general guidance, DDN recommends an AI400X2 appliance for every four X12
Gaudi AI servers. These configurations can be adjusted and scaled easily to match
specific workload requirements. An overview of the network architecture is shown
in the figure below.
MANAGEMENT NETWORK
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 27
FIGURE 8: DDN A3I REFERENCE ARCHITECTURE WITH FOUR X12 GAUDI AI SERVERS –
HIGH-PERFORMANCE NETWORKS
illustrates the DDN A3I architecture quad server configuration. Four X12 Gaudi AI
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 28
servers are connected to an AI400X2 appliance through a network switch. Every X12
Gaudi AI server connects to the storage & cluster management network switch via
two 100 GbE links. The AI400X2 appliance connects to the storage & cluster
management network switches via eight 100 GbE links. This ensures non-blocking
data communication between every device connected to the network. The multi-
path design provides full redundancy and maximum data availability in case a link
becomes unavailable.
Additionally, the X12 Gaudi AI servers are connected through a network switch for
Gaudi communication. Every X12 Gaudi AI server connects to the Gaudi network
switch via six 400 GbE links.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 29
Interface Description
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 30
The baseboard has standard interface/connectors to the HIB (Host Interface Board),
which allows the system designer the customization to design to specific needs and
the flexibility to build systems of choice with a different ratio of CPUs to
accelerators for different kind of topologies and applications.
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 31
Feature Description
• All to all connectivity for 8
Gaudi2 HL-225H cards
• OAM powered by 54v and
OAM 12v
support
• x16 PCIe Gen4 host
interface per OAM
• 8 X dual B2B connectors
• 8 X 16 PCIe Gen 4
connectors
• Power: 12V, 54V
Baseboard • Side band signals: I2C,
to HIB (Host
Interface Reset, reference clocks,
Board) JTAG, UART, SGMII, USB
Interface
• Eight Amphenol connectors:
2x 160P (10131762-101LF) +
6 x 112P (10137002-101LF)
Per OAM: 24x 100GbE PAM4 SerDes
Links split into:
• 21 x 100GbE for OAMs-to-
OAMs connections
Networking:
Card to • 3 x 100GbE for scale-out
Card &
Total Scale-out:
Scale-out
• 8 x 3 x 100GbE = 2.4TbE
connected to 6 QSFP-DD
ports
PCB • 585mm x 417mm x 4.6mm
dimension
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 32
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 33
PCIe Retimer
HIF Conn 4x8
x16
PCIe Retimer
QSFP-DD4 HIF Conn 4x8
x16
21 x 100G RoCE 21 x 100G RoCE PCIe Retimer
HIF Conn 6x8
OAM5 OAM4 x16
QSFP-DD5
HL-225H HL-225H
PAM4 Retimer3 3 x 100G RoCE 3 x 100G RoCE
QSFP-DD6
21 x 100G RoCE 21 x 100G RoCE
OAM6 OAM7
HL-225H HL-225H
3 x 100G RoCE 3 x 100G RoCE
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022
Gaudi®2 White Paper 34
X. HLS-GAUDI®2 Server
The HLS-Gaudi®2 system is a high-performance deep-learning server incorporating a
dual socket Xeon host subsystem and 8 Gaudi2 accelerators, which supports scaling
out using 24x100GbE RDMA ports.
Feature Description
System • 19”
Dimension
• 2* INTEL Xeon Ice Lake CPU
CPU head • 32* DDR4 DIMM
node
• 2* NIC
• 2* PCIe switch
HIB
• BMC + peripheral
• HLBA-225
• fully connected topology
Base
Board • 6x QSFP-DD connectors
(4x400G using 56G PAM 4
SerDes)
2020 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.1 | May 2022