0% found this document useful (0 votes)
53 views

TPU v4-dual

TPUv4 for Google

Uploaded by

冯昊阳
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

TPU v4-dual

TPUv4 for Google

Uploaded by

冯昊阳
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

TPU v4: An Optically Reconfigurable Supercomputer for

Machine Learning with Hardware Support for Embeddings


Industrial Product*

Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay
Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson
Google, Mountain View, CA
{jouppi,gkurian,lsheng,pcma,rahulnagarajan,lnai,nishantpatil,suvinay,aswing,btowles,cliffy,zhoux,zongweiz}@google.com
and [email protected]

ABSTRACT ACM Reference Format:


Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan,
In response to innovations in machine learning (ML) models, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian
production workloads changed radically and rapidly. TPU v4 is the Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, David Patterson. 2023.
fifth Google domain specific architecture (DSA) and its third TPU v4: An Optically Reconfigurable Supercomputer for Machine
supercomputer for such ML models. Optical circuit switches Learning with Hardware Support for Embeddings: Industrial Product. In
The 50th Annual International Symposium on Computer Architecture
(OCSes) dynamically reconfigure its interconnect topology to
(ISCA ’23), June 17–21, 2023, Orlando, FL, USA. ACM, New York, NY,
improve scale, availability, utilization, modularity, deployment, USA, 14 pages. https://doi.org/10.1145/3579371.3589350.
security, power, and performance; users can pick a twisted 3D
torus topology if desired. Much cheaper, lower power, and faster
than Infiniband, OCSes and underlying optical components are 1 INTRODUCTION
<5% of system cost and <3% of system power. Each TPU v4 Happily for architects, machine learning (ML) models continue to
includes SparseCores, dataflow processors that accelerate models evolve in challenging ways, both in scale and algorithmically (see
that rely on embeddings by 5x–7x yet use only 5% of die area and Table 1 and Section 7.7). Examples of the former are large
power. Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x language models (LLMs) and examples of the latter are the
and improves performance/Watt by 2.7x. The TPU v4 embeddings necessary for recommender systems (deep learning
supercomputer is 4x larger at 4096 chips and thus nearly 10x faster recommendation models or DLRMs) and the huge calculations of
overall, which along with OCS flexibility and availability allows a Transformers and BERT. The incredible scale of recent LLMs [6,
large language model to train at an average of ~60% of peak 38, 54] has stretched our ML supercomputer scale from 256 TPU
FLOPS/second. For similar sized systems, it is ~4.3x–4.5x faster v2 nodes to 4096 TPU v4 nodes. Reaching such a scale raises
than the Graphcore IPU Bow and is 1.2x–1.7x faster and uses reliability problems that are particularly compounded by the
1.3x–1.9x less power than the Nvidia A100. TPU v4s inside the HPC-style, checkpoint/restore, everything-must-work way that
energy-optimized warehouse scale computers of Google Cloud use deep neural network (DNN) training is performed. It is far from the
~2-6x less energy and produce ~20x less CO2e than contemporary software reliability typical of mainline Google distributed systems.
DSAs in typical on-premise data centers. This paper describes three major features of TPU v4 that
respond to these challenges:
CCS CONCEPTS 1. We addressed the scale and reliability obstacles by
• Computer systems organization → Architectures → Other introducing Optical Circuit Switches (OCSes) with optical data
architectures → Neural networks links, allowing a 4K-node supercomputer through reconfiguration
to tolerate 1K CPU hosts that are unavailable 0.1%–1.0% of the
time.
KEYWORDS 2. We disclose the hardware support for embeddings in
Machine learning, domain specific architecture, TPU, GPU, IPU, DLRMs (SparseCore or SC), part of TPUs since TPU v2.
supercomputer, optical interconnect, reconfigurable, embeddings, 3. Combining the first two capabilities, embeddings add
large language model, power usage effectiveness, warehouse scale all-to-all communication patterns to the demands on
computer, carbon emissions, energy, CO2 equivalent emissions supercomputer-scale interconnect. Unlike all-reduce used in
backpropagation, which maps well to 2D and 3D tori, all-to-all
______________ patterns strain bisection bandwidth. OCSes enable flexible
*This paper is part of the Industry Track of ISCA 2023's program. topology configuration, including twisted torus [7], which has
better bisection properties.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed LLMs are a hot topic today in ML circles. While scale and
for profit or commercial advantage and that copies bear this notice and the full citation reliability originally motivated OCSes in TPU v4, their topological
on the first page. Copyrights for third-party components of this work must be honored. flexibility and deployment advantages turned out to improve LLM
For all other uses, contact the Owner/Author. training time significantly.
ISCA '23, June 17–21, 2023, Orlando, FL, USA
© 2023 Copyright is held by the owner/author(s). Since prior papers have described the fundamentals of
ACM ISBN 979-8-4007-0095-8/23/06. previous TPUs for training [26, 39] and for inference [25, 27], this
https://doi.org/10.1145/3579371.3589350 paper focuses on the three novel features of TPU v4 listed above
TPU v4

·P· · · · · ·
· · · · ·

{jouppi,gkurian,lsheng,pcma,rahulnagarajan,lnai,nishantpatil,suvinay,aswing,btowles,cliffy,zhoux,zongweiz}@google.co
m [email protected]
ACM
·P· · · · ·
ML · · · ·
TPU v4 DSA · · · 2023 TPU v4
ML
50 ISCA '23 2
OCSes
023 6 17 21 ACM N
Y 14 https://doi.org/10.1145/3579371.3589350
3D Infiniband
OCSes <5%
<3% TPU v4 SparseCores
1
5 7 ML
5% 2020 TPU v4 1 7.7
TPU v3 2.1 2.7 TPU v4 large language models (LLMs)
TPU v3
4 4096 deep learning
10 OCS recommendation models DLRMs Transformers BERT
FLOPS/ ~60% LLM [6, 38, 54] ML
Graphcore IPU Bow ~4.3x 4.5x N 256 TPU v2 4096 TPU v4
vidia A100 1.2x 1.7x Nvidia A100 1.3x 1.9x
TPU v4 DNN HPC /
DSA ~2-6 CO2 Google
e ~20
TPU v4
CCS 1.
• → → → Optical Circuit Switches (OCSes)
4K 0.1% 1.0
% 1K CPU
2. DLRM SparseCore
TPU GPU IPU SC TPU v2 TPU
3.
all-reduce
2D 3D all-to-all
OCSes
Sure! Please provide the markdown source text that you would like me to translate into Chinese.
*This paper is part of the Industry Track of ISCA 2023's program. [7]

Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed LLMs
for profit or commercial advantage and that copies bear this notice and the full citation TPU v4 OCSes
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the Owner/Author. LLM
ISCA '23, June 17–21, 2023, Orlando, FL, USA
© 2023 Copyright is held by the owner/author(s). TPU
ACM ISBN 979-8-4007-0095-8/23/06. [26, 39] [25, 27]
https://doi.org/10.1145/3579371.3589350
TPU v4
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

that have not yet been described. The major contributions of the suggesting 4×4×4 (64 chips) or 8×8×8 (512). With 4 TPU v4s per
paper are: CPU host, 64 TPU v4 chips and their 16 CPU hosts comfortably fit
● It describes and evaluates the first production into one rack. As 512 chips need multiple racks, a 43 building
deployment of OCSes in a supercomputer and the first to allow block was chosen.
topology reconfiguration to improve performance.
● It describes and evaluates the first accelerator support for 2.2 Construction of the TPU v4 Supercomputer
embeddings in a commercial ML system. Figure 1 shows the links from the 6 “faces” of a 43 block. There are
● It documents the rapid change in production model types 16 links per face, totaling 96 optical links per block that connect to
since 2016 for the fast changing ML field (Table 1). OCSes. To provide the wraparound links of a 3D torus, the links on
● It shows how Google uses ML to co-optimize DNN the opposing sides must connect to the same OCS. Thus, each 43
models, OCS topology, and the SparseCore. block connects to 6 × 16 ÷ 2 = 48 OCSes. The Palomar OCS is
The next section introduces OCSes and explains their many 136×136 (128 ports plus 8 spares for link testing and repairs), so
benefits. Section 3 motivates the SparseCore and shows its 48 OCSes connect the 48 pairs of cables from 64 43 blocks (each
performance gains. Section 4 uses ML to search how to 64 chips), yielding the desired total of 4096 TPU v4 chips.
co-optimize the hardware and DNN models. The next two sections
compare performance on production workloads versus TPU v3 and
then versus the Nvidia A100 and the Graphcore MK2 IPU using
MLPerf. The paper ends with a discussion, a related work section,
and a summary.

Table 1: Workloads by DNN model type (% TPUs used). Over


90% of training at Google is on TPUs. The parenthesized entries
split Transformer models into the subtypes of BERT and LLM.
Columns 2 to 4 show workloads for inference [25], training and
inference [26], and inference [27]. The last workload is for training
on TPU v4s over 30 days in October 2022.
TPU v3
TPU v1 TPU v4 Lite TPU v4
4/2019
DNN Model 7/2016 2/2020 10/2022
(Training &
(Inference) (Inference) (Training)
Inference)
MLP/DLRM 61% 27% 25% 24%
RNN 29% 21% 29% 2%
CNN 5% 24% 18% 12%
Transformer -- 21% 28% 57%
(BERT) -- -- (28%) (26%)
(LLM) -- -- -- (31%)

2 RECONFIGURABLE OPTICAL SWITCH


We wanted to scale up the number of chips by 4x versus TPU v3
just as TPU v3 was 4x TPU v2. Given the distance between TPU
v3 racks, some wrap-around links of its 2D torus topology were so Figure 1: Connectivity of a 4×4×4 cube (top) to 3 OCSes
long that they had to be optical due to the reach limitation of (bottom). The “+” and “–” connections with the same
electrical interconnects. Optical links are >10x more expensive dimension and index are connected to the same OCS; 48 of
than electrical links. At 4x the scale, there would be even more these in-out pairs each connect to a distinct OCS.
optical links. Moreover, there were concerns about the bisection
bandwidth of a 2D torus of that size and the availability of a single
Figure 2 below shows a TPU v4 package and four of them
system of that scale. Using a 3D torus increases bisection
mounted on the printed circuit board. Like TPU v3, each TPU v4
bandwidth and the OCS acts like a plugboard to skip failed units.
contains two TensorCores (TC). Each TC contains four 128x128
Matrix Multiply Units (MXUs) and a Vector Processing Unit (VPU)
2.1 Optical Circuit Switching with 128 lanes (16 ALUs per lane) and a 16 MiB Vector Memory
To improve data center networking, Google advanced the (VMEM). The two TCs share a 128 MiB Common Memory
state-of-the-art in reliability and cost of optical transceivers and (CMEM). The PCB embeds 4 Inter-Core Interconnect (ICI) links,
OCSes [43,54]. The resulting Google Palomar OCS is based on 3D connected as a 2×2 mesh; 16 external ICI links go to other trays for
Micro-Electro-Mechanical Systems (MEMS) mirrors that switch in constructing the 3D torus. Figure 3 below shows one row of eight
milliseconds. They employ circulators to send light both ways in a racks, where each rack contains 16 tray-host server pairs. Passive
fiber, halving the number of required ports and cables. electrical cables create a 4×4×4 3D mesh in a rack. Electrical-to
What size of the electrically-cabled building block to use? -optical conversions happen at the fiber connector to the TPU
Given the 3D torus, 3D cubes have the best bisection bandwidth, trays. There are no other conversions until the light reaches an

2
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

4×4×4 64 8×8×8 512


CPU 4 TPU v4 64 TPU v4 16 CP
OCSes U 512
43
2.2 TPU v4
1 43 6 16
2016 96 OCS 3
1 D
DNN
OCS SparseCore OCS 43 6 × 16 ÷ 2 = 48 OCS
OCSes 3 Palomar OCS 136×136 128 8
SparseCore 4 48 OCS 64 43
DNN 64 48 4096 TPU v4
TPU v3 Nvi
dia A100 Graphcore MK2 IPU MLPerf

1 DNN % TPUs
90% TPU Transform
er BERT LLM 2 4 [
25] [26] [27]
2022 10 TPU v4 30

TPU v3
TPU v1 TPU v4 Lite TPU v4
4/2019
DNN Model 7/2016 2/2020 10/2022
(Training &
(Inference) (Inference) (Training)
Inference)
MLP/DLRM 61% 27% 25% 24%
RNN 29% 21% 29% 2%
CNN 5% 24% 18% 12%
Transformer -- 21% 28% 57%
(BERT) -- -- (28%) (26%)
(LLM) -- -- -- (31%)

2
TPU v3 4 TPU v3 TPU v2
4 TPU v3 2D
1 4×4×4 3 OCS +
>10 4 OCS 48
2D OCS
3D
OCS 2 TPU v4
TPU v4 TPU v3 TPU v4
TensorCores (TC) TC 128x128
2.1 Matrix Multiply Units (MXUs) 128
16 ALU Vector Processing Unit (VPU) 16 MiB
OCSes [43,54] Vector Memory (VMEM) TC 128 MiB (
Palomar OCS 3D MEMS CMEM) 4 Inter-Core Interconnect (ICI)
2×2 16 ICI
3D 3
3D 3D 16 -
4×4×4 3D
TPU

2
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

optical-to-electrical converter at the fiber connector of the reasonable slice goodput. OCSes also have fair goodput for 99.0%
destination tray. The 48 OCSes join eight rows together to form the and 99.5% for most slice sizes. Figure 4 assumes all slice size
complete 64-rack system. requests are equal, but workloads have many sizes (Table 2).

Table 2: Sampling of popularity of TPU v4 slices for a day in


November 2022. This table includes all slices used ≥ 0.1%.
Twistable (Section 2.8) but not twisted means the slice geometry
allows twisting (n×n×2n or n×2n×2n), but the user picks the
regular topology. The software scheduler requires that slices have
dimensions x ≤ y ≤ z. Half of the slices have x, y, and z as either 4
or 8.
Chips <64 64
Figure 2: The TPU v4 package (ASIC in center plus 4 HBM 1×1×1 (1) 2.1%
stacks) and printed circuit board with 4 liquid-cooled 1×1×2 (2) 0.4%
packages. The board's front panel has 4 top-side PCIe connectors 1×2×2 (4) 6.7%
Regular Tori 4×4×4 (64) 13.9%
and 16 bottom-side OSFP connectors for inter-tray ICI links. 2×2×2 (8) 4.7%
2×2×4 (16) 6.4%
2×4×4 (32) 8.9%
Total % 29% 14%
Chips 128-192 256-384
4×4×8_T
Twisted Tori 16.0% 4×8×8_T (256) 9.2%
(128)
Twistable, not 4×4×8_NT 4×8×8_NT
1.5% 1.5%
twisted Tori (128) (256)
Figure 3: Eight of 64 racks for one 4096-chip supercomputer. 4×4×16 (256) 1.0%
Regular Tori 4×4×12n (192) 0.7%
4×8×12 (384) 0.1%
Total % 18% 12%
Chips 512-768 1024-1536
Twisted Tori 8×8×16_T (1K) 1.8%
Twistable, not 8×8×16_NT
1.4%
twisted Tori (1K)
4×16×16 (1K) 0.3%
8×8×8 (512) 9.6% 4×4×64 (1K) 0.1%
4×8×16 (512) 1.7% 4×8×32 (1K) 0.1%
Regular Tori 8×12×16
4×4×32 (512) 0.6% 0.1%
(1.5K)
8×8×12 (768) 0.7% 4×4×96 (1.5K) 0.1%
8×8×24 (1.5K) 0.1%
Figure 4: Impact of OCS connected versus a statically con- Total % 13% 4%
nected supercomputer on goodput (i.e., effective throughput) as Chips 2048-3072
CPU availability and slice size varies on a log scale. Goodput is Twisted Tori 8×16×16_T (2K) 1.4%
counterintuitive at large slices. At ¼ of the 4K chips, goodput for Twistable, not
8×16×16_NT (2K) 0.3%
both 99.0% and 99.5% is 75%, as 3 slices occupy ¾ of the chips. twisted Tori
Spares are needed to allow scheduling jobs despite some failed 12×16×16 (3K) 5.7%
Regular Tori
nodes, so you can’t realistically schedule two 2k node slices from 4×4×192 (3K) 0.4%
4k nodes. With one 2k node slice (50% of 4k), you have 50% Total % 8%
spares, so it will have 50% goodput. With 3k nodes (75% of 4k),
you have 25% spares, and therefore 75% goodput.
2.4 OCS Deployment Benefits.
2.3 OCS Availability Benefits The OCSes also shrank deployment time. TPU v3 systems were
An OCS raises availability by routing around failures. The main not usable until all 1024 chips and all cables were installed and
problem is the CPU host; each host has 4 TPU v4s, so 1K hosts per tested. Delivery delays for any component held up the entire
supercomputer. Like HPC supercomputers, the workload consists supercomputer. For TPU v4, OCSes made each rack independent,
of a range of scale sizes, called slices: 64 chips, 128 chips, 256 so each 43 block was put into production as soon as 64 chips and
chips, and so on. Figure 4 shows the “goodput” of slice sizes as the necessary cables were installed and tested. Incremental
host availability varies from 99.0% to 99.9% with and without deployment greatly improved the time to production use and thus
OCSes. Without OCSes, host availability must be 99.9% to offer cost effectiveness of the TPU v4 supercomputers.

3
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

48 OCS put OCSe 99.0% 99.5%


64 put 4
2

2 2022 11 TPU v4
≥ 0.1% 2.8
n×n×2n n×2n×2n
x≤y≤z
x y z 4 8

Chips <64 64
1×1×1 (1) 2.1%
2 TPU v4 ASIC HBM
1×1×2 (2) 0.4%
4 4 PCIe
1×2×2 (4) 6.7%
16 OSFP ICI Regular Tori 4×4×4 (64) 13.9%
2×2×2 (8) 4.7%
2×2×4 (16) 6.4%
2×4×4 (32) 8.9%
Total % 29% 14%
Chips 128-192 256-384
4×4×8_T
Twisted Tori 16.0% 4×8×8_T (256) 9.2%
(128)
Twistable, not 4×4×8_NT 4×8×8_NT
1.5% 1.5%
twisted Tori (128) (256)
3 4096 64 8 4×4×16 (256) 1.0%
Regular Tori 4×4×12n (192) 0.7%
4×8×12 (384) 0.1%
Total % 18% 12%
Chips 512-768 1024-1536
Twisted Tori 8×8×16_T (1K) 1.8%
Twistable, not 8×8×16_NT
1.4%
twisted Tori (1K)
4×16×16 (1K) 0.3%
8×8×8 (512) 9.6% 4×4×64 (1K) 0.1%
4×8×16 (512) 1.7% 4×8×32 (1K) 0.1%
Regular Tori 8×12×16
4×4×32 (512) 0.6% 0.1%
(1.5K)
8×8×12 (768) 0.7% 4×4×96 (1.5K) 0.1%
8×8×24 (1.5K) 0.1%
4 OCS Total % 13% 4%
goodput CPU Chips 2048-3072
Twisted Tori 8×16×16_T (2K) 1.4%
goodput
Twistable, not
4K 99.0% 99.5% goodp 8×16×16_NT (2K) 0.3%
twisted Tori
ut 75% 3 12×16×16 (3K) 5.7%
Regular Tori
4×4×192 (3K) 0.4%
4K 2K 2K Total % 8%
4K 50% 50% goodpu
t 50% 3K 4K 75% 25%
2.4 OCS .
goodput 75%
2.3 OCS OCS TPU v3 1024
OCS CPU
4 TPU v4 1000 HP TPU v4 OCS
C 43 64
slices 64 128 256 4
99.0% 99.9% TPU v4
OCS OCS
99.9%

3
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

2.5 OCS Scheduling Benefits different numbers of TPUs along each dimension. Alternatively,
[7] proposes a topology that outperforms rectangular tori with
The OCS also simplifies scheduling, which increases utilization.
lower latency and higher bisection bandwidth without increasing
For TPU v3, a 256 chip slice meant the scheduler had to find 256
switch hardware. The twisted torus rewires some links between 43
contiguous chips that were idle. For TPU v4, it can pick four 43
cubes to reduce the worst case latency. Figure 5 shows a regular
blocks from anywhere in the supercomputer. Slices don’t even
topology and a twisted topology. Since TPU v4 uses an OCS to
need to be a power of 2; they can be 4i×4j×4k, where 0 < i ≤ j ≤ k.
connect 43 blocks, the “rewiring” is mostly reprogramming of
For example, a user could request a 192 TPU v4 slice with a
routing in the OCS. Using all-to-all communication with large
geometry of 4×4×12.
messages as a microbenchmark, Figure 6 below shows the
2.6 OCS Modularity and Security Benefits performance gain from the twisted topology. The twisted torus
improves all-to-all throughput by 1.63x and 1.31x over the regular
Since the OCS can switch circuits in milliseconds, TPU v4 can torus on 4×4×8 and 4×8×8 slices, respectively. While the popular
easily change topology to match the application, the number of interconnect topologies today are Clos networks and Dragonfly
nodes, and the system that runs those jobs. TPU v4 provides networks, the twisting option reduces the worst case bisection
wraparound links of a 3D Torus for most slice sizes, which doubles bandwidth for the 3D tori and makes them more attractive for
both the bisection bandwidth and the bandwidth of important today’s supercomputers.
collective communication operations (e.g., all-reduce) versus the
mesh-like alternative [12], yet still allowing the TPU v4 to scale
interconnect bandwidth up to 163 (4096) chips. OCS also enables
an air gapped network isolation between different slices, which
enhances the security of multiple customers sharing a TPU v4
supercomputer.

2.7 Tailoring OCS Topology to Improve Performance


The final benefit was a bonus beyond solving problems of large
scale. To set the stage, here are the three fundamental types of
parallelism that improve the training time of DNNs:
1. Data Parallelism: Each chip computes the forward and
backward pass on a subset of examples, and sends the gradients
that it calculates for its subset to the other chips.
2. Model (or Tensor) Parallelism: Large tensor operations
and their weights are divided across multiple chips, so that each
chip simultaneously computes a subset of a tensor operation.
3. Pipeline Parallelism: For a DNN with many layers, each
chip computes a subset of layers, and communicates the layer
Figure 5: Example of regular (top) and twisted torus (bottom)
results to chips holding the adjacent layers.
topologies for a 4×2 slice of TPU v4 nodes. The TPU v4 network
Users can change the TPU v4 topology to match the type of
is three-dimensional, but the figure uses two dimensions for ease
parallelism being used, e.g., for a 512 slice, pipeline parallelism
of illustration. Each TPU is labeled with its coordinates in the
might want a cigar shape (4×4×32) instead of the conventional 83
slice. The electrical connections (red dashed lines) remain fixed.
cube (8×8×8). For the highest bisection bandwidth, often needed
By utilizing the flexibility of the OCSs, the optical connections
by embedding heavy applications, the conventional 83 cube is
(blue solid lines) can be reconfigured from a rectangular torus to a
preferred. ML practitioners often combine parallelism types to get
twisted torus without any physical recabling of the machine; the
the best performance, such as data plus model parallelism. Model
only change is in the routing tables. TPU v4 uses a k×k×2k
parallelism typically has two parameters: width and length. To take
configuration from Camarero, Martinez, and Beivide [8].
full advantage of the bandwidth available, users map data
parallelism along one dimension of the 3D torus and the two model
parallel parameters on the other dimensions. Table 3 in Section 4
gives examples of performance gains of 1.2x to 2.3x by varying
the topology and the hyperparameters.
TPU v4 uses a single, static topology for each training job,
which can be co-optimized with the communication requirements
of the training job. This per-job configuration is not a fundamental
limitation of the OCS. See [55] for more details of the OCS and its
physical construction.

2.8 Twisting the Torus


For a slice with a number of TPUs that is a perfect cube, a Figure 6: Measured all-to-all throughput for 4×4×8 and 4×8×8
symmetric torus with an equal number of TPUs along each slices using regular and twisted tori. Measurements are steady
dimension minimizes latency and maximizes bisection bandwidth state (large aggregate transfer size) with individual DMAs being 4
(e.g., 83 = 512 TPUs organized as an 8x8x8 torus). For KiB. Each column also shows the theoretical delta from the ideal
non-perfect-cube slices, a rectangular torus can be built with peak as a stacked bar and label above the measured performance.

4
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

2.5 OCS TPU [7]


OCS TPU v3
256 256 twisted torus 4
3
TPU v4 5
43 2 4i×4j×4k TPU v4 OCS 43
0<i≤j≤k 4×4×12 OCS
192 TPU v4 6
4×4×8 4×8×8
2.6 OCS 1.63 1.31
OCS TPU v4 Clos Dragonfly
3D
TPU v4 3D
[12]
TPU v4
163 (4096) OCS
TPU v4

2.7 OCS

DNN

1. Data Parallelism:

2. Model (or Tensor) Parallelism:

3. Pipeline Parallelism: DNN


5 4×2 TPU v4
TPU v4 TPU v4
512 4×4 TPU
×32 83 8×8×8 OCS
3
8
TPU v4 Camarero M
artinez Beivide [8] k×k×2k
3D
4 3
1.2 2.3

TPU v4

OCS OCS
[55]

2.8
TPU 6 4×4×8 4×8×8
TPU
3
8 = 512 TPU 8x8x8 DMA 4 KiB

4
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

2.9 Distribution of Topologies 3.3 Distributed Training


Table 2 above gives a sampling of slice topologies used in Embedding tables are large, and can range in size from O(10 MiB)
production. Some tasks are smaller than a 43 block, so they can to O(100 GiB). In aggregate, all the embedding tables in a model
only use a 2D mesh; 29% are smaller than a 43 cube, so they can be as large as several TiBs. Hence, such tables are partitioned
obviously cannot select a twisted 3D torus. Of the remaining 71%, across the memory of several TPU chips. There are three methods
only those of the form n×n×2n or n×2n×2n (n≥4) can twist. They for partitioning: (1) column sharding splits tables along their width
are 33% (48% of 71%). The actual twisted tori are 28% (86% of across multiple chips, (2) row sharding splits tables along their
33%). Stated alternatively, 40% of the topologies that are 43 blocks vocabulary size, and (3) table sharding places different tables on
or larger use twisted tori. different chips. These distribution strategies are collectively termed
model parallelism in the context of neural network models. For
small embedding tables, replication across all chips (using data
2.10 Cost of OCS Flexibility parallelism) is better for performance.
Remarkably, given all the benefits of OCSes, their cost is <5% of
the total TPU v4 supercomputer capital costs and <3% of total 3.4 Key Performance Attributes
power. The power and cost accounting includes the entire optical Embedding lookup operations consist mainly of small gather or
fabric, including the optics modules, fiber, and OCS infrastructure. scatter memory accesses, which have low arithmetic intensity. As
opposed to dense operations (e.g., transformers, fully connected
networks) where the chip FLOPS/second is the main driver of
3. SPARSECORE: EMBEDDINGS SUPPORT end-to-end performance, embedding lookup operations are
bottlenecked by the memory bandwidth, memory capacity, and
Before introducing the next TPU innovation, let’s review VPU (vector processing unit) performance. The ICI interconnec-
recommendation models, embeddings, and distributed training to tion network (across chips) is also a significant performance driver.
set the stage. The interconnect bandwidth and performance depends on the
type of parallelism being exploited. For model parallelism
3.1 Recommendation Models (common case), the communication pattern consists of
variable-length all-to-all exchange. The network bisection
Deep learning recommendation models (DLRMs) are a quarter of bandwidth limits performance. For data parallelism, the
our ML workload (Table 1 above). DLRMs are used in advertising, communication pattern consists of all-reduce operations, which
search ranking, YouTube, and Google Play applications [1, 4, 11, injection bandwidth limits.
13]. Google’s production advertising models score ads for billions The unstructured sparsity of embeddings is also prone to
of queries daily, and consist of billions of weights, and train on compute, memory, and communication load imbalances across a
more than one trillion examples, and are required to perform supercomputer. To reduce load imbalance, deduplication of
inference at well over one hundred thousand requests per second frequent feature values is commonly used, and must be efficiently
[1]. DLRM model sizes are determined using five factors: supported by the compute substrate. Deduplication also reduces the
prediction quality, training time, total training cost, serving latency, number of memory accesses, and the quantity of data sent over the
and total serving cost. Embeddings are a key component of interconnection network, further improving performance.
DLRMs.
3.5 SparseCore
3.2 Embeddings It’s time to introduce the SparseCore (SC). For the training phase,
Inputs for DLRMs consist mainly of categorical features. Each embeddings could be placed on the TensorCore or the host CPUs
categorical feature contains N discrete values (N is commonly of the supercomputer. The TensorCore has wide VPU and matrix
referred to as the vocabulary size). For example, in the search units, and is optimized for dense operations. Placing embeddings
ranking application, the search query is a categorical feature, and N on the TensorCore would be suboptimal due to small gather/scatter
is the number of words in the English language. A given training memory accesses, and variable length data exchange. Placing
example (query) is sparse, and contains a tiny subset of words. embeddings on the host CPUs of a supercomputer would induce an
Neural networks typically train well on dense vectors. Amdahl’s Law bottleneck over the CPU DRAM interface,
Embeddings are the standard and effective way to transform amplified by the 4:1 TPU v4 to CPU host ratio. Tail latency and
categorical feature values into dense vectors. An embedding bandwidth restrictions of data-center networking would further
function translates from the large, categorical space (e.g., words in constrain the training system.
the English language) to a smaller, dense space (e.g., a 100-vector Performance could be optimized using the total HBM
representing each word). Embedding functions are implemented capacity of a TPU supercomputer, joined by a dedicated ICI
using lookup tables. An example is a table with 80,000 rows (one network, and with fast gather/scatter memory access support. This
per word) of width 100. Each training example can look up either insight led to the codesign of the SparseCore (SC).
one row (univalent embedding) or a small, dynamic number of The SC is a domain-specific architecture for embedding
rows (multivalent embedding, typically combined by summing). A training starting with TPU v2, with later improvements in TPU v3
neural network model might have many tables of many sizes for and TPU v4. SCs are relatively inexpensive, at a total of only ~5%
different categorical features. Embeddings are a key component of of the die area and ~5% of the power. SCs operate in a sea-of-cores
Google DLRMs, and typically form the first layer in a neural configuration, combining supercomputer-scale HBM and ICI to
network model. create a flat, globally addressable memory space (128 TiB in TPU

5
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

2.9 3.3
3
2 4 O(10 MiB) O(100 GiB)
2D 29% 43 TiB
3D 71% n×n×2 TPU (1)
n n×2n×2n n≥4 33% 71% 48 column sharding (2)
% 28% 33% 86% 40 row sharding (3) table sharding
% 43

2.10 OCS
OCS TPU v4
<5% <3% 3.4
OCS

FLOPS/

3. SPARSECORE
VPU ICI
TPU

3.1
```zh DLRM
1 DLRM
YouTube Google Play [1, 4, 11, 13] Google

[1] DLRM

DLRM ```

3.5 SparseCore
3.2 SparseCore (SC)
DLRM TensorCore CPU TensorCore
N N VPU
N TensorCore /
CPU
CPU DRAM Amdahl 4:1 TPU
v4 CPU

100 TPU HBM


80,000 ICI /
100 SC

G SC TPU v2
oogle DLRM TPU v3 TPU v4 SC
~5% ~5% SC
HBM ICI
TPU 128 TiB

5
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

v4). In contrast to all-reduces of large parameter tensors in dense 3.6 SparseCore Performance
training, the all-to-all transfers of smaller embedding vectors use
The end-to-end embedding lookup performance is essentially
HBM and ICI with finer-grained access patterns for scatter/gather.
proportional to the bisection bandwidth due to the all-to-all
As separate cores, SCs allow parallelization across dense
transfers of small embedding vectors. For the 2D torus used in
compute, SC, and ICI communications. Figure 7 shows the SC
TPU v2 and TPU v3, this bandwidth scales as N1/2 for N chips. The
block diagram, which we consider a “dataflow” architecture
3D torus in TPU v4 scales as N2/3 [12]. Figure 8 shows that the
because data flows from memory to a variety of directly connected
TPU v3/v4 bisection bandwidth ratio is 2–4x higher at a given chip
specialized compute units.
count and accelerates embeddings by 1.1x–2.0x. At 1024 chips, SC
The most general SC units are the 16 compute tiles (dark blue
overheads start to dominate, so bisection bandwidth is less
boxes in Figure 7). Each tile has an associated HBM channel and
important.
supports multiple outstanding memory accesses. Each tile has a
Figure 9 below shows performance of an internal production
Fetch Unit, a programmable 8-wide SIMD Vector Processing Unit
recommendation model (DLRM0, see Sections 7.8 and 7.9) across
(scVPU, not to be confused with VPU of the TC in TPU v4), and a
the two TPU generations for 128 chips. The standalone CPU
Flush Unit. The Fetch Unit reads activations and parameters from
configuration has 576 Skylake sockets (400 for learners and 176
the HBM into the tile’s slice of a 2.5 MiB Sparse Vector Memory
for variable servers). The bottom two bars show TPU v4 without
(Spmem). The scVPU uses the same ALUs as TC's VPU. The
SC, where the embeddings are placed in CPU memory. The “Emb
Flush Unit writes updated parameters to HBM during the
on CPU” bar places embeddings in CPU host memory and the
backward pass. In addition, the five Cross-Channel Units (gold
“Emb on Variable Server” bar places embeddings on 64 external
boxes in Figure 7) perform specific embedding operations, which
variable servers. TPU v3 is faster than CPUs by 9.8x. TPU v4
their names explain. Like TPU v1, the units execute CISC-like
beats TPU v3 by 3.1x and CPUs by 30.1x. When embeddings are
instructions and operate on variable-length inputs, where the
placed in CPU memory for TPU v4, performance drops by 5x–7x,
run-time of each instruction is data-dependent. The cross-channel
with bottlenecks due to CPU memory bandwidth.
units operate across all 16 banks of Spmem collectively.

Figure 9: Performance of an internal recommendation model


(DLRM0) on CPUs, TPU v3, TPU v4, and TPU v4 with
embeddings in CPU memory (not using SparseCore). The numbers
in parenthesis indicate the number of sockets (CPU or TPU) used
for training.

Figure 7: SparseCore (SC) Hardware Architecture. 4 USING ML TO TAILOR THE DNN TO


THE TPU AND THE TPU TOPOLOGY TO
THE DNN
To enable Pareto-optimizations over quality and performance for
DNN models, we developed platform-aware neural architecture
search (PA-NAS) at scale to tailor DNN models for TPU v4
supercomputers automatically [32]. A PA-NAS designed CNN1
achieves ~1.6X better performance (QPS and latency) than the
baseline designed by generic NAS, with comparable accuracy [32].
Unlike [33], here we show how PA-NAS improved the
performance of DLRM0 on TPU v4.
DLRMs are twice as popular as CNNs for our ML workloads
(see Table 1 above). Unlike CNNs, which mostly use TensorCores
(TCs), DLRMs use both SCs and TCs. PA-NAS can shift
computation load between sparse layers (running on SCs) and
dense layers (running on TCs) for Pareto-optimal performance and
quality.
Figure 8: Bisection bandwidth ratio of TPU v4 to TPU v3 and Figure 10 below shows PA-NAS on a production-scale
performance sensitivity to bisection bandwidth. The model used DLRM model. Despite having been optimized manually and by
is a DLRM with ~100M dense parameters in fully connected using a generic NAS, the original DLRM0 idled the SC ~25% of
layers, ~20B embedding parameters (~300 features mapped to the execution time (top blue bar in Figure 10) because of the load
~150 tables), and 1-100 average valency per feature. The global imbalance between SCs and TCs. PA-NAS enables the end-to-end
batch size is scaled proportionately to the number of chips. Pareto-optimizations on quality and performance over both

6
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

v4) all-reduce 3.6 SparseCore


all-to-all HBM ICI
/ scatter/gather TPU v2 TPU v3
SCs SC ICI
7 SC N N1/2 TPU v4
N2/3 [12] 8
TPU v3/v4 2 4 1.1
2.0 1024 SC
SC 16 7
HBM
Fetch Unit 8 SIM 9 DLRM0 7.8
D Vector Processing Unit (scVPU TPU v4 TC VPU 7.9 TPU 128
Flush Unit HBM CPU 576 Skylake 400 176
2.5 MiB Sparse Vector Memory (Spmem) scVPU SC TPU v4
TC VPU ALU CPU Emb on CPU
HBM 7 CPU Emb on Variable Se
Cross-Channel Units ( ) rver 64 TPU
TPU v1 CISC v3 CPU 9.8 TPU v4 TPU v3 3.1 CPU 30.1
CPU TPU v4 5
16 Spmem 7 CPU

9 (DLRM0) CPU TPU v3 TPU v4


CPU TPU v4 SparseCore
CPU TP
U

7 SparseCore (SC) . 4 DNN TPU


TPU DNN

DNN
PA-NAS DN
N TPU v4 [32] PA-NAS CN
N1 QPS NAS 1.
6 (~) [32] [33]
PA-NAS DLRM0 TPU v4

DLRM CNN
1 TC CNN
DLRM SC TC PA-NA
S SC TC

8 TPU v4 TPU v3 10 PA-NAS DLRM


DLRM ~100M NAS D
~20B ~300 ~150 LRM0 SC 25% 10
1-100 SC TC PA
-NAS

6
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

embedding layers (running on SC) and hidden layers (running on 5 PRODUCTION WORKLOAD
TC) for DLRM0, which approaches perfect SC-TC load-balance
(lower blue and red bars in Figure 10) and improves DLRM0 PERFORMANCE
end-to-end performance by >10%. This performance uplift is Table 1 above shows the workload mix for 2022 plus the history of
equivalent to improvements historically achieved by a team of >10 how it has changed over time; Section 7.7 discusses the workload
experts over about half a year, further demonstrating PA-NAS’s changes. We use 8 applications to capture the production workload
capability of increasing accelerator flexibility and performance to compare TPU v3 to TPU v4. Figure 11 shows how efficiently
gain. eight production workloads scale up on TPU v4. We are
We can also use search to tailor the TPU v4 topology to the encouraged that half of the workloads (CNN0, RNN0, RNN1, and
DNN Model. Table 3 shows gains in performance from searching BERT1) scale well to 3K chips. (Figure 4 above shows that given
for configurations for two models. The first example increased CPU hosts with availability of 99.0% to 99.9%, in practice it is
performance 2.3x by changing the geometry for a 512 TPU v4 much easier to schedule a 3K slice than a 4K slice.) For the
slice over the novice user’s initial design for an LLM. The second remainder, our production teams are aware of where the scaling
example shows a harder task, demonstrating an improvement of limits are and are constructing solutions, but those solutions are not
1.2x over an expert’s design for the pre-training phase of GPT-3. yet implemented to allow measuring performance at full 3K scale

Figure 10: Performance improvements by PA-NAS for


DLRM0. Since DLRMs use both SCs and TCs, the maximum of Figure 11: Scalability of TPU v4 production workloads on a
SparseComputingTime and DenseComputingTime is the end-to- log-log scale. Infrastructural limitations currently hinder getting
end training step time of a DLRM. All computing times in the the last few data points: BERT0 scales to 2K, DLRM0/1 to 1K.
figure (SparseComputingTime and DenseComputingTime for both
original and optimized DLRM0) are normalized against the
DenseComputingTime of the original DLRM0, since the original Table 4: TPU v4 and TPU v3 [26] features. Measured power is
DLRM0 is bottlenecked on Dense Computing. Embedding sizes for the ASIC and HBM running production applications.
and FLOPs are normalized against those of the original DLRM0. Google TPUv4 TPUv3

Table 3: Improvements in performance as we vary the topology Production deployment 2020 2018
of a 512 TPU v4 slice for training an LLM and the GPT-3 Peak TFLOPS 275 (bf16 or int8) 123 (bf16)
pre-training stage. The original LLM used model parallelism of Clock Rate 1050 MHz 940 MHz
dimensions 16×32 and no pipeline or data parallelism for a 4×8×16 Tech. node, Die size 7 nm, <600 mm2 16 nm, < 700 mm2
topology. The revision changed model parallelism to 64×8 and the Transistor count 22 billion 10 billion
topology to 8×8×8. The “1D/2D activation/weight partitioning” Chips per CPU host 4 8
option is a typical way of partitioning tensors in a large model TDP N.A. N.A.
graph (see Figure 7 in [63]). For GPT-3, the original row used a Idle, min/mean/max
8×8×8 topology, pipeline parallelism of depth 8, no data 90, 121/170/192 W 123, 175/220/262 W
power
parallelism, and model parallelism of dimensions 8×8. The Inter Chip Interconnect 6 links @ 50 GB/s 4 links @ 70 GB/s
revision changed the topology to 4×8×16, pipeline depth to 16, Largest scale
data parallelism to 4, model parallelism parameters to 1×8. 4096 chips 1024 chips
configuration
Hyper-Parameters (topology, Single Instruction Single Instruction
Processor Style
partition spec [pipeline, data, 2D Data 2D Data
Through- model1, model2], 1D/2D Processors / Chip 2 2
put activation/weight Threads / Core 1 1
Case Versions (seqs/sec) partitioning) SparseCores / Chip 4 2
Novice’s 128 (CMEM) +
17.9 (1.0x)4×8×16, [1, 1, 16, 32], 2D/2D 32 MiB (VMEM) +
LLM pick On Chip Memory 32 MiB (VMEM) +
5 MiB (spMEM)
Best perf. 41.3 (2.3x)8×8×8, [1, 1, 64, 8], 1D/2D 10 MiB (spMEM)
GPT-3 Pre- Expert’s pick 21.0 (1.0x)8×8×8, [8, 1, 8, 8], 2D/2D Register File Size 0.25 MiB 0.25 MiB
training Best perf. 25.0 (1.2x)4×8×16, [16, 4, 1, 8], 1D/1D HBM2 capacity, BW 32 GiB, 1200 GB/s 32 GiB, 900 GB/s

7
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

SC TC DLRM0 5
SC-TC 10
>10% DLRM0
1 2022
>10
7.7 8
PA-NAS
TPU v3 TPU v4 11
TPU v4
TPU v4 DNN CNN0 RNN0 RNN1 BERT1
3 3K 4 CPU
512 TPU v4 99.0% 99.9% 4K
LLM 2.3 3K
GPT-3
1.2 3K

10PA-NAS DLRM0 DLRM


SC TC SparseComputingTime DenseComputingTime 11 TPU v4 -
DLRM BERT0
DLRM0 SparseComputingTime 2K DLRM0/1 1K
DenseComputingTime DLRM0 DenseComputin
4 TPU v4 TPU v3 [26] ASIC HBM
gTime DLRM0 Dens
e Computing FLOP DLRM0
Google TPUv4 TPUv3

3 512 TPU v4 LLM Production deployment 2020 2018


GPT-3 LLM 16×32 Peak TFLOPS 275 (bf16 or int8) 123 (bf16)
Clock Rate 1050 MHz 940 MHz
4×8×16
Tech. node, Die size 7 nm, <600 mm2 16 nm, < 700 mm2
64×8
Transistor count 22 billion 10 billion
8×8×8 1D/2D / Chips per CPU host 4 8
[63] 7 GPT TDP N.A. N.A.
-3 8×8×8 8 Idle, min/mean/max
90, 121/170/192 W 123, 175/220/262 W
8×8 power
4×8×16 16 4 Inter Chip Interconnect 6 links @ 50 GB/s 4 links @ 70 GB/s
Largest scale
1×8 4096 chips 1024 chips
configuration
Hyper-Parameters (topology, Single Instruction Single Instruction
Processor Style
partition spec [pipeline, data, 2D Data 2D Data
Through- model1, model2], 1D/2D Processors / Chip 2 2
put activation/weight Threads / Core 1 1
Case Versions (seqs/sec) partitioning) SparseCores / Chip 4 2
Novice’s 128 (CMEM) +
17.9 (1.0x)4×8×16, [1, 1, 16, 32], 2D/2D 32 MiB (VMEM) +
LLM pick On Chip Memory 32 MiB (VMEM) +
5 MiB (spMEM)
Best perf. 41.3 (2.3x)8×8×8, [1, 1, 64, 8], 1D/2D 10 MiB (spMEM)
GPT-3 Pre- Expert’s pick 21.0 (1.0x)8×8×8, [8, 1, 8, 8], 2D/2D Register File Size 0.25 MiB 0.25 MiB
training Best perf. 25.0 (1.2x)4×8×16, [16, 4, 1, 8], 1D/1D HBM2 capacity, BW 32 GiB, 1200 GB/s 32 GiB, 900 GB/s

7
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

Table 4 compares key features of TPU v3 and TPU v4. Figure 13 shows results with CMEM turned off on TPU v4; it
Manufactured in 7 nm instead of 16 nm, TPU v4 has twice the contributes to 1.2x performance gain overall but 2x for RNN1. It
matrix multipliers (enabled by the increased process density) and also shows that TPU v4 has 2.1x the performance and 2.7x the
an 11% faster clock—this drives the 2.2X gain in peak performance/Watt of TPU v3; as mentioned above, ~40% of the
performance. About 40% of the performance/Watt improvement gain was from the technology and the rest from design.
was from technology and the rest was from design improvements LLM training will become a benchmark in a future MLPerf
(e.g., balancing the pipeline, implementing clock gating). The release. We omit performance of internal LLMs—31% of the
HBM memory bandwidth is 1.3x higher. Depending on the slice workload in Table 1—on TPU v3 because it is unoptimized.
size, the bisection bandwidth of TPU v4 is 2x–4x (see Figure 8 TPUv3’s 2D fixed topology hinders high-performance model
above). It also has the 128 MB on-chip CMEM scratchpad partitioning needed for LLMs. We also lack TPUv3 chip capacity
memory not found in TPU v3. to train large LLMs within reasonable time-to-convergence SLOs
Figure 12 shows how much faster TPU v4 supercomputers are given their lower FLOPS/second and suboptimal model
than TPU v3 supercomputers at the same slice size. Given the partitioning performance.
comparisons in Table 4, it’s not surprising that at the same slice
size most applications run 1.5x-2.0x faster on TPU v4 than on TPU Table 5: Features of the two DSAs [21, 40] reporting MLPerf
v3. DLRM0 is 3.0-3.5x faster and DLRM1 is 2.8x at 512 chips as 2.0 Training results besides TPU v4. The A100 has 32×108 =
TPU v4 has twice as many SCs and their clock rate is faster. The 3456 threads and the IPU has 6×1472 = 8832 threads.
surprise is RNN1; it runs 3.3x faster on TPU v4. RNN1’s small
weights and small batch size benefit significantly from CMEM Nvidia A100 Graphcore MK2 IPU
bandwidth versus HBM. Production deployment 2020 2021
Peak TFLOPS 312 (bf16), 624 (i8) 250 (bf16)
Clock Rate Base/Boost 1095 /1410 MHz 1850 MHz
Tech. node, Die size 7 nm, 826 mm2 7 nm, 832 mm2
Transistor count 54 billion 59 billion
Chips per CPU host 4 4
TDP 400 W 300 W
Inter Chip Interconnect 12 links @ 25 GB/s 3 links @ 64 GB/s
Largest scale MLPerf
4216 chips 256 chips
2.0 configuration
Single Instruction Multiple Instruction
Processor Style
Multiple Threads Multiple Data
Processors / Chip 108 1472
Threads / Core 32 6
Figure 12: Speedup of TPU v4 vs v3 for the same slice sizes.
On Chip Memory 40 MiB 900 MiB
Register File Size 27 MiB 1.40 MiB
HBM2 capacity, BW 80 GiB, 2039 GB/s 0

6 MLPERF BENCHMARK PERFORMANCE


This section compares DSAs based on the already published
results for MLPerf Training [36]. Table 5 compares the features of
two of the entries: the NVIDIA A100 and Graphcore MK2 IPU.
While the top six rows are quite similar to TPU v4 in Table 4, the
remaining rows show the diverse choices of the architects in terms
of processor style, number of processors per chip, number of
threads per processor, register file size, and on-chip vs. off-chip
memory. First, TPU v4 has only 2 threads while A100 has 32×108
= 3456 threads and 6×1472 = 8832 threads for the IPU. The
striking features of the IPU Bow are the ~1500 cores, 900 MB of
on-chip SRAM, and zero attached HBM or other DRAM. The
striking features of the A100 are a 27 MB register file (to support
multithreading) and only 40 MB of on-chip SRAM. Both chips use
full-reticle dies, making them ~40% larger than the TPU v4 die.
Figure 14 below shows the fastest performance per DSA for
five MLPerf benchmarks. (Graphcore ran two of the five.) Vendors
Figure 13: Per chip performance (top) and package-level are free to pick the size of the system for which they want to report
performance/Watt (bottom) for production applications results. Ideally, MLPerf would benchmark systems of equal size or
relative to TPU v3 for CMEM turned on and off for smaller cost or power, but that is not required.
systems (e.g, 32 chips). DLRM1 is much faster at 512 chips. Figure 15 below shows the reported results for ResNet and
DLRMs here are different from MLPerf DLRM (see Section 7.9). BERT as large points while the dashed lines between the points are

8
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

4 TPU v3 TPU v4 TPU v4 13 TPU v4 CMEM


7 16 1.2 RNN1 2
11% TPU v4 TPU v3 2.1 / TPU v3
2.2 40% /Watt 2.7 ~40%
LLM MLPerf
HBM 1.3 TP 1 LLM TPU v3 31%
U v4 2 4 8 T TPUv3 LLM
PU v3 128 MB CMEM TPUv3 FLOPS/
TPUv3
12 TPU v4 TPU LLM
v3 4
TPU v4 TPU v3
5 TPU v4 MLPerf 2.0 DSA
1.5 2.0 DLRM0 512 3.0 3.5 D
[21, 40] A100 32×108 = 3456 IPU 6×1472
LRM1 2.8 TPU v4 SC
= 8832
RNN1 TPU v4 TPU v3
3.3 RNN1 CMEM HB Nvidia A100 Graphcore MK2 IPU
Production deployment 2020 2021
M
Peak TFLOPS 312 (bf16), 624 (i8) 250 (bf16)
Clock Rate Base/Boost 1095 /1410 MHz 1850 MHz
Tech. node, Die size 7 nm, 826 mm2 7 nm, 832 mm2
Transistor count 54 billion 59 billion
Chips per CPU host 4 4
TDP 400 W 300 W
Inter Chip Interconnect 12 links @ 25 GB/s 3 links @ 64 GB/s
Largest scale MLPerf
4216 chips 256 chips
2.0 configuration
Single Instruction Multiple Instruction
Processor Style
Multiple Threads Multiple Data
Processors / Chip 108 1472
Threads / Core 32 6
12 TPU v4 v3
On Chip Memory 40 MiB 900 MiB
Register File Size 27 MiB 1.40 MiB
HBM2 capacity, BW 80 GiB, 2039 GB/s 0

6 MLPERF
MLPerf DSA [36] 5
NVIDIA A100 Graphcore MK2 IPU
4 TPU v4

TPU v4 2 A100 32×108 = 3456


IPU 6×1472 = 8832 IPU Bow ~1500
SRAM 900 MB HBM DRAM
A100 27MB
40MB SRAM
TPU v4 ~40%

14 MLPerf DSA
Graphcore
13 CMEM
MLPerf
32 TPU v3
/ 512 DLRM1
15 ResNet BERT
DLRM MLPerf DLRM 7.9

8
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

interpolations based on the number of chips. The published Table 6: Mean power for DSA plus HBM for 64 chip systems
MLPerf results for TPU v4 and A100 both scale to much larger running MLPerf. Adding the switches would show even higher
systems than the IPU (4096 vs 256 chips). For similar sized efficiency for TPU v4 (Section 7.3). We use nvidia-smi reported
systems, TPU v4 is 1.15x faster for BERT than the A100 and power measurement on Azure Standard_ND96amsr_A100_v4
~4.3x faster than the IPU. For ResNet, TPU v4 is 1.67x and ~4.5x VMs during a rerun of Nvidia's MLPerf 2.0 64-chip submission.
faster, respectively. TPU v4 power measurements are done by running Google MLPerf
Table 6 shows the power we measured running MLPerf 2.0 benchmark code on 64-chip scale in Google data center. The
benchmarks; A100s use on average 1.3x–1.9x more power. TPU v4 mean power measurement is 2%–8% higher than in Table
4, but the workloads differ. MLPerf 3.0 may add power
measurements to performance in the October 2023 round.
MLPerf Benchmark A100 TPU v4 Ratio
BERT 380 W 197 W 1.93
ResNet 273 W 206 W 1.33

7 DISCUSSION
We comment next on 11 questions readers might have from our
Figure 14: Reported MLPerf Training 2.0 highest performance analysis of TPU v4 and the other DSAs.
[36] relative to A100. Each column labels the number of chips per
system. Graphcore submitted results for BERT and ResNet. TPU 7.1 Do peak FLOPS/second predict real performance?
v4 DLRM is in the research category. The MLPerf DLRM is not
Many in the ML community think peak FLOPS/second are a good
representative of production DLRMs [64] (see Section 7.9).
performance proxy [45], but they are not. For example, TPU v4 is
4.3x–4.5x faster on two MLPerf benchmarks than IPU Bow on
equal sized systems despite only having a 1.10x edge in peak
FLOPS/second. Another example is that the A100 peak
FLOPS/second rate is 1.13x TPU v4, but TPU v4 is 1.15x–1.67x
faster for the same number of chips. Figure 16 gives the
relationship between peak FLOPS/sec and memory bandwidth
using the roofline model [61].

Figure 16: Roofline models for TPU v4, TPU v3, and A100 plus
DNN models [61]. Operational intensity is in parentheses.
Figure 15: Reported MLPerf training 2.0 performance for
BERT (top) and ResNet (bottom) [36] relative to an 8-Way The A100 higher peak performance is for Boost Mode of up
A100 GPU on a log-log scale. To include IPUs, we only show to 1410 MHz; the base clock is 1095 MHz. If the average rate was
BERT and ResNet in this figure. At the largest scale of 4096 chips, 1243 MHz, the peak performance of the A100 and TPU v4 would
TPU v4 is 1.15x as fast as the Nvidia A100 for BERT. At 256 be equal (same ceiling in the roofline model). We measured the
chips, the maximum IPU size in MLPerf, TPU v4 is ~4.3x as fast A100 running MLPerf BERT, and the average clock rate was 1280
as the MK2 IPU Bow. At the largest scale, 4096 TPU v4s are 1.67x MHz due to capping of power use by Nvidia GPU software.
as fast as 4216 Nvidia A100s for ResNet. At 256 chips, TPU v4 is Amdahl’s Law reminds us that system balance—in compute,
~4.5x as fast as the MK2 IPU Bow. The points are the reported memory, and interconnect—with sufficient power and cooling to
results, and the dashed lines are interpolations for intermediate keep everything executing is still important. The integrated ICI
sizes systems. For TPU v4, the results for ≤2048 chips are from network lets TPU supercomputers scale performance gracefully
MLPerf Training 1.0; all the other points for all systems are from and the OCSes let users tailor the topology to the application to
MLPerf Training 2.0. improve performance.

9
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

TPU v4 A100 MLPerf 6 MLPerf 64 DSA


HBM
IPU 4096 256 TPU v4 7.3
TPU v4 A100 BERT 1.15 ~ nvidia-smi Azure Standard_ND96ams
IPU 4.3 ResNet TPU v4 A100 1.67 r_A100_v4 Nvidia MLPerf 2.0 64
~ IPU 4.5 TPU v4 Google Goo
6 MLPerf A gle MLPerf 2.0 64 TPU v
100 1.3 1.9 4 4 2% 8%
MLPerf 3.0 2023 10

MLPerf Benchmark A100 TPU v4 Ratio


BERT 380 W 197 W 1.93
ResNet 273 W 206 W 1.33

7
TPU v4 DSA
14 MLPerf 2.0 A100 [36] 11
Graphcore BERT ResNet
TPU v4 DLRM MLPerf DLRM 7.1 FLOPS/
DLRM[64] 7.9 FLOPS/
[45] TPU v4 MLPerf
IPU Bow 4.3 4.5 FLOPS/
1.10 A100 FLOPS/
TPU v4 1.13 TPU v4
A100 1.15 1.67 16
[61] FLOPS/

16 TPU v4 TPU v3 A100 DNN Roofline


[61]
15 MLPerf Training 2.0 BERT R
esNet [36] 8 A100 GPU A100 Boost 1410 MHz
IPU BERT ResNet 1095 MHz 1243 MHz A100 TPU v4
4096 TPU v4 Nvidia A10
0 1.15 BERT 256 MLPerf A100 MLPerf BERT 128
IPU TPU v4 MK2 IPU Bow ~4.3 0 MHz Nvidia GPU
4096 TPU v4 4216 Nvidia A100 1.67
ResNet 256 TPU v4 MK2 IPU Bow
~4.5 ICI TPU
TPU v4 ≤2048 MLPerf Tr OCS
aining 1.0 MLPerf Training 2.0

9
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

7.2 How does OCS differ from NVLink and NVSwitch? Speaking of apples-to-apples, both TPU v4s and A100s were
deployed in 2020 and both use 7 nm technology. The newer, 700W
Every TPU since 2017 has had its own built-in router between
H100s were not available at AWS, Azure, or Google Cloud when
links (ICI) to neighboring chips in the torus. Like NVlink, it
we did the research in 2022 or even when we submitted the final
enables “glueless” connection of TPUs, but at a larger scale than
camera ready paper in 2023. The appropriate H100 match would
4-8 GPUs: 256 TPU v2s and 1024 TPU v3s.
be a successor to TPU v4 widely deployed in a similar time frame
We think of optical circuit switching as the next generation of
and technology (e.g., in 2023 and 4 nm).
ICI versus a response to the latest NVSwitch, which uses electrical
packet switching for 8 GPUs with one switch and up to 256 GPUs
with two levels of switches. 7.5 Why 30%–90% more power for A100 (Table 6)?
OCSes are just fibers connected by mirrors, so any bandwidth It is hard to find a complete quantitative answer for these two
running through a fiber can be switched between input and output complex designs. The 4x larger on-chip SRAM (160 MB versus 40
fibers by the OCS across 4096 chips today (or even more in the MB) allows memory transfers to DRAM to be in larger blocks,
future). For example, an OCS could handle multiple improving energy efficiency. Figure 13 above shows turning on the
terabits/second per link by using wavelength multiplexing. CMEM local memory, which increases on-chip SRAM from 32
Moreover, all inputs can be connected to all outputs, but the MB to 160 MB, improves performance by 1.18x and
connections must be 1:1. performance/Watt by 1.24x.
The following three qualitative factors could explain the rest
7.3 What if TPU v4 used IB versus OCS? of the gap, but we can't say definitively without additional work.
Let’s start with Infiniband (IB) versus OCS switches. Just as Support for multithreading on the GPU leads to a 100x larger
NVLink connects 8 GPUs in a DGX, 8 TPUs would use ICI. We register file (27 MiB versus 0.25 MiB), which likely requires more
follow Nvidia’s guidance by using a full 3-level fat tree for the energy for register accesses—even though the GPU uses a single
hybrid IB/ICI network [41]. At an average of one NIC per GPU, a port SRAM—as power generally increases with the square root of
1120 A100 superpod needs 164 Mellanox QM8790 40-port IB memory capacity [22]. Second, the 128x128 MXUs of TPU v4
switches [41], each priced at ~$15k–$18k [10, 23]. The 1120 IB mean each 128 entry input gets reused 128 times, whereas the 4x4
NICs are extra. To replace the 48 128-port OCSes, 4096 TPU v4s FP16 array multipliers of the A100 only get reused 4 times, leading
need 568 IB switches. An OCS is no more expensive per port than to more on-chip SRAM accesses. Finally, the ~40% larger A100
an IB switch, but it can support higher bandwidth because it is chip may have longer data buses, increasing data transmission
passively reflecting light encoded at the source. The hybrid IB/ICI energy.
option is substantially more expensive and harder for software.
Furthermore, active packet processing for an IB switch is far 7.6 What is the CO2e from TPU v4 vs other DSAs?
more power hungry than the tiny amount of power required to hold There is considerable concern about the carbon footprint of ML
the MEMS mirrors to their configured orientation in an OCS. [42, 45]. Let’s compare the cloud-only TPU v4 to a hypothetical
ICI link bandwidth is 2x IB—400 vs 200 Gbit/s—but system recent DSA in an on-premise data center.
speed is harder to judge. An internal event-driven simulator that Practitioners can reduce operational energy use and CO2
operates at the TensorFlow graph operation level evaluated a emissions by optimizing the “4Ms” [42]:
hybrid ICI/IB network. (It ignores protocol processing on the CPU, 1. Let’s assume TPU v4 and another DSA are training the
which can be significant.) Depending on the slice size, an same models, so the Model parameter is 1.0 in this case.
optimized all-reduce would run 1.8x–2.4x slower and an all-to-all 2. The Machine parameter is measured in perform-
would be 1.2x–2.4x slower. This network heterogeneity is also a ance/Watt. TPU v4 is 2.7x TPU v3. The MLPerf power plan is in
software challenge. As communication is only a portion of training progress, so we estimate for others. TPU v4 is 1.2x–1.7x faster and
time, overall IB slowdown for a DNN might be as little as ~10%. 1.3x–1.9x lower power than the A100 and 4.3x-4.5x faster than the
However, the biggest impact is losing the benefits that IPU, whose TDP is 300 W. TPU v4 chip performance/Watt is thus
originally inspired the use of OCS (Section 2): availability, scale, ~2x–6x versus a contemporary DSA; to be conservative, we
utilization, modularity, power efficiency, deployability, and so on. assume 2x for this calculation.
3. The Mechanization parameter is data center power
7.4 Nvidia announced the H100, the successor to A100, efficiency, measured as Power Usage Effectiveness (PUE). For on
in 2022. Why not compare TPU v4 to it? premise data centers, PUE is often high. Computer architects
improved PUE by helping advance the state-of-the-art of
After systems are running production applications in the field, the warehouse scale computers (WSCs) [2, 3, 16, 30, 44, 62]. For
Google tradition is to write retrospective, peer-reviewed papers for example, Google halved its average energy overhead from 21%
prominent conferences. The rationale is that the intellectual (PUE = 1.21) in 2008 to 10% (PUE = 1.10) [20]. Worldwide
incentives and deadlines for a prestigious publication encourage average PUE fell from 2.50 in 2008 to 1.57 [52] as users closed
architects working on the next generation to take the time to reflect their older data centers and switched to WSCs in the cloud [34].
and to make careful, detailed, apples-to-apples comparisons to the The relative energy consumption is then 2 × 1.57 ÷ 1.10 or
previous chip and contemporary alternatives that can pass peer 2.85x more energy (kWh) on a contemporary DSA in an average
review. The lessons learned improve future designs. The good on-premise data center versus TPU v4 in Google Cloud.
news is that these retrospective, peer-reviewed, apples-to-apples 4. Map factors in the cleanliness of the energy supply,
papers are widely read, e.g., [25] has >4000 citations. If Google which varies considerably by location. WSCs can be placed
sold chips versus using them internally, we might instead need to anywhere, while on-premise data centers depend on the local grid.
publish unreviewed whitepapers much earlier in the chip lifecycle. For example, the average portion of carbon free electrical energy

10
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

7.2 OCS NVLink NVSwitch TPU v4 A100 2020


2017 TPU 7 2022 2023
ICI NVlink T 700W H100 AWS A
PU 4-8 GPU 256 TPU zure Google Cloud H100 TPU v4
v2 1024 TPU v3 2023
ICI
4
NVSwitch
8 GPU 256 G 7.5 A100 30% 90% 6
PU OCS
4
OCS 4096 SRAM 160 MB 40 MB DRAM
OCS 13 CM
/ EM SRAM 32 MB 160 MB
1:1 1.18 / 1.24

7.3 TPU v4 IB OCS GPU 100


Infiniband (IB) OCS NVLink D 27 MiB 0.25 MiB
GX 8 GPU 8 TPU ICI Nvidia GPU SRAM
3 Fat Tree IB/ICI [41 [22] TPU v4
] GPU NIC 1120 A100 1 128x128 MXU 128 128 A100
64 Mellanox QM8790 40 IB [41] ~$15 4x4 FP16 4 SRA
k $18k[10, 23] 1120 IB NIC 48 128 M ~40% A100
OCS 4096 TPU v4 568 IB OCS
IB
IB/ICI and
IB OCS MEMS 7.6 TPU v4 DSA CO2e
[42, 45]
TPU v4 DSA
ICI 2 IB 400 200 Gbit/s
TensorFlow 4Ms [42]
ICI/IB CPU
1. TPU v4 DSA
1.8 2.4 1.2 2.4 Model 1.0
2. Machine / TPU v4
TPU v3 2.7 MLPerf
DNN IB ~10%
TPU v4 A100 1.2x 1.7x 1.3x
OCS 2 1.9x TDP 300 W IPU 4.3x 4.5x TPU v4
/ ~2x 6x DSA
2x
3. Mechanization
7.4 2022 H100 A100 Power Usage Effectiveness (PUE) P
TPU v4 UE
warehouse scale computers (WSCs) PUE [
2, 3, 16, 30, 44, 62] Google 2008
21% PUE = 1.21 10% PUE = 1.10 [20]
PUE 2008 2.50 1.57 [52]
WSCs [34]
2 × 1.57 ÷ 1.10 2.85 kWh
DSA Google Cloud
[25]
TPU v4
>4000 4. Map
WSC
carbon free electrical energy

10
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

(CFE) in 2021 for the US was 40%, but it was 88% for Google’s architecture highlights the importance of having a compiler that
Oklahoma data centers [19]. They exploit the plentiful local efficiently leverages the features of the underlying DSA.
renewable energy [56]. Fortunately, Oklahoma hosts all TPU v4s
in Google Cloud. The global average conversion factor from 7.9 Is MLPerf’s DLRM benchmark (Figure 14 above)
electricity to CO2 equivalent emissions—CO2e, including realistic?
greenhouse gasses like methane—is 0.475 kg/kWh [24]. After
acquiring renewable energy in Oklahoma, matched on an hourly Production DLRM workloads scale much better [64] than MLPerf
basis with our energy consumption, it dropped to 0.074. DLRM for a few reasons. MLPerf DLRM has <2M FP32 weights
The estimated operational CO2e for a contemporary DSA in while DLRM0 has 137M Int8 weights (see Figure 17). Second, the
an average on-premise data center is then 2.85 × 0.475 ÷ 0.074 or global batch size of MLPerf DLRM is capped at 64k for optimal
~18.3x higher than training on TPU v4 in Google Cloud. The CFE model quality for this data set and optimizer, limiting batch size to
for data centers depends on the local grid plus the availability and 128 per SC on a 128-chip system (128 chips × 4 SCs/chip × 128 =
acquisition of renewable energy. Given such a large impact 64k). In contrast, internal recommendation models often reach
—crucial for anyone building or using ML infrastructure—it’s batch sizes of 2048–4096 and usefully scale up to 1024 chips (see
fortunate that Google has WSCs with very high CFE to house TPU Figure 11 above). Third, MLPerf DLRM has only 26 univalent
v4 supercomputers. features (compared to 100s in internal workloads) and no
multivalent features. For these reasons, fixed overheads per batch
7.7 How fast do ML workloads change? such as HBM latency and CISC instruction generation time on the
SC core sequencer are much higher on MLPerf DLRM than
Table 1 above shows the rapid change for production workloads at production workloads. Overheads like these limit its useful
Google. Note the drop in RNNs. Like RNNs, Transformers are scalability to ≤128 chips for TPU v4 and A100.
popular for natural language translation and text summarization,
but unlike RNNs they process the input all at once rather than
sequentially. This revised model architecture means the operations 7.10 TPU v4 has less HBM capacity than A100; could
can occur in parallel, which in turn means Transformer models can that limit LLM performance?
process much larger data sets. Two years after the Transformer Our autoML LLM configuration search (Section 4) considers
paper was published [58], it was >20% of the TPU v3 workload. HBM capacity, ICI-connected aggregated FLOPS/HBM
The Transformer models BERT and GPT-3 have also changed the bandwidth, and other model parameters (such as batch size). The
workload landscape. Two years after the BERT paper was HBM capacity could be a limiting factor in some cases, but
published [14], it was >25% of Google’s workload on TPU v4, and typically TPUv4 enables larger models to be partitioned across
it remained significant in 2022. Two years after publication of more chips with effective compute-communication overlap [59]
GPT-3 [6], LLMs were >30% of the TPU v4 production with little overhead (for example, two case studies in Table 3
workloads. ML workloads can change dramatically in the two or above). However, given the higher HBM capacity but smaller
more years it takes to design, build, and deploy a new ML NVlink-connected domain of Nvidia GPUs, Nvidia’s best LLM
supercomputer. configuration might be very different from the best one for TPU
v4.

7.11 How can DSAs avoid overspecialization?


Building a robust DSA with a multi-generational lifetime is a
balancing act between domain specialization and generality with
flexibility. This harmony is especially important given ML is such
a fast evolving field (see Sections 7.7 and 7.8). For example, had
we devoted significant resources for specialized acceleration of
RNNs, that effort would have been of little use when RNN
popularity plummeted. As a positive example, by providing a
flexible, general, and balanced design, TPU v4 proved to be an
excellent match to LLMs that were popularized [Bro22] after it
was deployed.

Figure 17: Change in size of DLRM0 over time measured in 8 RELATED WORK
weights and embeddings. Each point is a new version of
DLRM0 (43 total). Each weight is 1 byte and each embedding TPU v4 uses a dedicated 3D torus interconnect. Traditional
is 4 bytes. supercomputers also employ tightly connected multiprocessors
over 3D tori with a high-bandwidth interconnect [15, 50]. Nvidia
GPU-based systems today use a two-tier network hierarchy with
7.8 Do individual DNN models also change?
NVLink and NVSwitch among groups of 4 to 256 GPUs and
Figure 17 shows the change in weights and embeddings for Infiniband beyond that.
DLRM0 from 2017 to 2022. Weights grew 4.2x and embeddings Twisted tori are not a recent invention. The ILLIAC-IV
grew 3.8x. Over those five years a new version was released every twisted one dimension of its wrap-around links for the 2D torus
~6 weeks (43 total). DLRM0 ran on all five TPU products over that it used [5]. Sequin introduced doubly-twisted tori for mapping
those five years. The velocity of change of the models and of the binary trees onto processor arrays [47]. Camara, Moreto, et al.

11
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

(CFE) 2021 40% Google DSA


88%[19] [56]
Google Cloud TPU v4
CO2 equivalent emissions CO2e 7.9 MLPerf DLRM 14
0.475 kg/kWh[24]
DLRM MLPerf DLRM
0.074 [64] MLPerf DLRM <2M FP32 DL
DSA CO2e 2.85 × RM0 137M Int8 17 MLPerf DLRM
0.475 ÷ 0.074 ~18.3 Google Cloud TPU v4 64
CO2e CFE k 128 SC 128 128 ×
4 SCs/chip × 128 = 64k 2
ML 048 4096 1024
Google CFE WSC TPU v4 11 MLPerf DLRM 26

HBM SC CISC
7.7
MLPerf DLRM
1 Google RNN
TPU×≤128 A100 and
RNN Transformer
RNN
7.10 TPU v4 HBM A100 LLM
Transformer Trans
former [58] TPU v3 >20 autoML LLM 4 HBM ICI
% Transformer BERT GPT-3 FLOPS/HBM
BERT [14] TPU v4 HBM TP
>25% 2022 GPT-3 Uv4
[6] LLMs TPU v4 >30% - [59] 3
ML Nvidia GPU HBM
NVlink Nvidia LLM TPUv4

7.11 DSA
DSA

7.7 7.8
RNN RNN

TPU v4
LLMs [Bro2
2]

17 DLRM0 8
DLRM0 43
TPU v4 3D
1 4
3D [15, 50]
Nvidia GPU
7.8 DNN NVLink NVSwitch 4 256 GPU Infin
17 2017 2022 DLRM0 iband
4.2 3.8 ~6 ILLIAC-I
43 DLRM0 V [5
TPU and ] Sequin
[47] Camara Moreto

11
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

made the case for twisting 3D tori [7], and TPU v4 follows the Moreover, replacing OCS and ICI with Infiniband increases
k×2k×2k configuration from Camarero, Martinez, and Beivide [8]. costs, raises power consumption, and degrades performance.
Shalf et al. and Kamil et al. proposed a mix of circuit TPU v4 is faster and lower power than contemporary DSA
switching and packet switching for use in traditional chips made using similar technologies deployed close to the same
supercomputers [49,28] and Kamil et al. later suggested that a time and for similar sized systems. The power edge might be even
MEMS-based OCS could provide circuit switching [29]. A recent larger if the interconnects are included.
paper has a similar investigation plus topology and parallelization Training time of LLMs is greatly reduced over TPU v3 by
co-optimizations to accelerate ML [60]. For data centers, there using 3K TPU v4 slices with their 3D torus topology. The
have been many proposals for OCS-based networks [17, 31, 35, performance, scalability, and availability make TPU v4
51, 53, 57]. Some even include the concept of topology supercomputers the workhorses of large language models (LLMs)
engineering. However, all of this related work are paper designs or like LaMDA, MUM, and PaLM [54, 38, 9]. These features allowed
proof-of-concept, testbed scale demonstrations, in contrast to the the 540B parameter PaLM model to sustain a remarkable 57.8% of
widespread deployment of OCSes at Google for data center the peak hardware floating point performance over 50 days while
networks and for supercomputer interconnect. We believe that TPU training on TPU v4 supercomputers [9].
v4 is the first commercial supercomputer built using OCS and the Google has deployed dozens of TPU v4 supercomputers,
first supercomputer built with a reconfigurable interconnect that including eight for external use via Google Cloud. Moreover, the
enhances performance. large size of the TPU v4 supercomputer and its reliance on OCSes
The TPU v4 supercomputer implements a logically shared looks prescient given that the design began two years before the
address space across physical chips. Software explicitly controls paper was published that has stoked the enthusiasm for LLMs [6].
access and data movement; remote memories are available through Advances by computer architects in the state-of-the-art of
asynchronous DMA writes only. The Cray T3E enabled a similar warehouse scale computing (WSCs) save energy and thus help
logically shared address space, with bulk asynchronous reads and reduce the carbon footprint of ML. When energy-efficient TPU v4
writes, load-store access to remote memories, and a rich suite of supercomputers are housed inside energy-efficient WSCs that rely
atomic operations [46]. The TPU v4 memory system is tailored for on ~90% carbon free electricity, they can consume only ~⅙–½ of
high performance, with each chip maintaining tens of thousands of the energy and produce only ~5% of the operational CO2e from
outstanding memory requests (to both local and remote memories). training on a typical contemporary ML DSA in the average
Acceleration of embeddings is key for DLRMs used for on-premise data center. A ~20x reduction in carbon footprint
business-critical applications, and we believe Google was the first greatly increases the chances of delivering on the amazing
to include on-ASIC hardware support in TPU v2, deployed in potential of ML in a sustainable manner [42].
2017. Neo from Facebook (Meta) [37] trains embedding tables
with up to 12T parameters using a 128-GPU system. Neo also ACKNOWLEDGEMENTS
exploits table, row, column and data parallelism; overlaps
communication and compute; and improves kernel fusion. Nvidia’s Special thanks go to the Google Platforms Optics team that kept
MLPerf entries use similar techniques, e.g., custom fused kernels advancing the state-of-the-art of OCSes and optical transceivers
for reduction over Infiniband. Two recent papers present other after other organizations gave up. We also thank John Hennessy,
optimizations for embeddings [18, 48]. Robert Hundt, Christos Kozyrakis, Sridhar Lakshmanamurthy, Jae
W. Lee, Hong Liu, Aamer Mahmood, Partha Ranganathan, Adrian
Sampson, Amin Vahdat, Amir Yazdanbakhsh, George Yuan, and
9 SUMMARY the anonymous ISCA reviewers for their feedback and suggestions
Two major architectural features of TPU v4 have small cost but on this paper.
outsized advantages. The SparseCore accelerates embeddings of We finally wish to thank our collaborators who helped with the
DLRM models by 5x-7x by providing a dataflow sea-of-cores measurements for this paper—Ben Albrecht, Jinliang Wei, Shibo
architecture that allows embeddings to be placed anywhere in the Wang, Yuechao Pan, and Ziqiang Feng for Table 3 and
128 TiB physical memory of the TPU v4 supercomputer. This gain ​Lluis-Miquel Munguia, Hao Wu, and Yuechao Pan for Table
comes at the cost of only ~5% in die area and power. 6—and the many engineers who developed the TPU v4 chip,
The OCSes and underlying optical components at the heart of hardware, software, and the many teams contributing to its
the TPU v4 supercomputer are relatively inexpensive at <5% of deployment, including but not limited to: Zach Cotton, Pedram
overall costs and <3% of overall power consumption, yet it Dashti, Bill Edwards, Hema Hariharan, Chetan Kale, Georgios
provides a remarkable set of eight benefits: Konstadinidis, Alan Kulawik, Justin Lee, Hong Liu, Erji Mao,
1. Scalability. Omkar Pathak, Erick Tuttle, Daoyi Wang, Kevin Yasumura, and
2. Improved availability, which enables the TPU v4 Sara Zebian.
supercomputer to be 4x larger than TPU v3.
3. Modularity, allowing the faster 3D torus topology from REFERENCES
64 to 3072 chips and novel shapes like twisted tori.
4. Higher performance, as users can pick the topology that [1] Anil, R., Gadanho, S., Huang, D., Jacob, N., Li, Z., Lin, D., Phillips,
T., Pop, C., Regan, K., Shamir, G.I. and Shivanna, R., 2022. On the
is best for their application.
Factory Floor: ML Engineering for Industrial-Scale Ads
5. Diminished power, as MEMS optical circuit switching is Recommendation Models. 16th ACM Conference on Recommender
more energy efficient than electronic packet switching. Systems.
6. Simplified scheduling to improve utilization. [2] Barroso, L.A., and Hölzle, U., 2009, First Edition. The datacenter as
7. Faster deployment, for better return on investment. a computer: An introduction to the design of warehouse-scale
8. Enhanced security, which encourages different machines. Synthesis lectures on computer architecture, 6(3),
organizations to share use of TPU v4 supercomputers. pp.1-120.

12
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

3D [7] TPU v4 Camarero Martin Infiniband OCS ICI


ez Beivide [8] k×2k×2k
Shalf Kamil TPU v4 DSA
[49,28] Kamil
MEMS OCS [29]
LLMs 3D
[60] OCS 3K TPU v4 TPU v3 TPU v4
[17, 31, 35, 51, 53, 57]
LaMDA MUM PaLM [54, 38, 9]
540B PaLM TPU v4
OCS TPU v4 57.8% of
OCS the peak hardware floating point performance over 50 days [
9] TPU v4
Google Cloud TPU v4
OCSes
TPU v4
two years before LLMs [
D
6] WSCs
MA Cray T3E
ML
[46] TPU v4 TPU v4 ~90%
WSC ML DSA
~ ½ CO2e
~5% ~20
DLRM [42]
TPU v2 A
SIC 2017 Facebook Met
a Neo [37] 128 GPU 12T
Neo
Nvidia MLPerf
Infiniband OCS Jo
[18, 48] hn Hennessy Robert Hundt Christos Kozyrakis Sridhar Laksh
manamurthy Jae W. Lee Hong Liu Aamer Mahmood Partha
Ranganathan Adrian Sampson Amin Vahdat Amir Yazdanba
9 khsh George Yuan ISCA
TPU v4 S
parseCore DLRM
5x-7x TPU v4 Ben Albrecht Jinliang Wei Shibo Wang Yuechao Pan Ziqi
128 TiB ~5 ang Feng 3 Lluis-Miqu
% el Munguia Hao Wu Yuechao Pan 6
TPU v4 OCS TPU v4
<5% <3% Zach Cotton Pedram Dashti Bi
ll Edwards Hema Hariharan Chetan Kale Georgios Konstadin
idis Alan Kulawik Justin Lee Hong Liu Erji Mao Omkar
1. 2. TPU v4 TP
Pathak Erick Tuttle Daoyi Wang Kevin Yasumura Sara Ze
U v3 4 3. 64 3072 3D
bian
4.
5.
MEMS 6. [1] Anil, R., Gadanho, S., Huang, D., Jacob, N., Li, Z., Lin, D., Phillips, T.,
7. Pop, C., Regan, K., Shamir, G.I. Shivanna, R., 2022.
. 16 ACM . [2] B
8. TPU v4
arroso, L.A., Hölzle, U., 2009, . The datacenter as
a computer: An introduction to the design of warehouse-scale
machines. Synthesis lectures on computer architecture, 6(3), 1-1
20 .

12
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

[3] Barroso, L.A., Hölzle, U. and Ranganathan, P., 2018. The datacenter [23] Insight, 2022. Mellanox Quantum QM8790 - switch - 40 ports,
as a computer: Designing warehouse-scale machines, Third Edition. https://www.insight.com/en_US/shop/
Synthesis Lectures on Computer Architecture, 13(3), pp.i-189. product/MQM8790-HS2F/Mellanox/MQM8790-HS2F/MellanoxQua
[4] Bloomberg October 26, 2015. Google turning its lucrative web search ntumQM8790-swit/.
over to AI machines. [24] International Energy Agency, Global Energy & CO2 Status Report
[5] Bouknight, W.J., Denenberg, S.A., McIntyre, D.E., Randall, J.M., 2019, Report Extract Emissions.
Sameh, A.H. and Slotnick, D.L., 1972. The ILLIAC IV system. [25] Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa,
Proceedings of the IEEE, 60(4), pp. 369-388. R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin,
[6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb,
P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., B., Ghaemmaghami, T., Gottipati, R., Gulland, W., Hagmann, R., Ho,
2020. Language models are few-shot learners. Advances in Neural C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A.,
Information Processing Systems 33 (NeurIPS 2020). Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A.,
[7] Camara, J.M., Moreto, M., Vallejo, E., Beivide, R., Miguel-Alonso, Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z.,
J., Martínez, C. and Navaridas, J., 2010. Twisted torus topologies for Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M.,
enhanced interconnection networks. IEEE Transactions on Parallel Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie,
and Distributed Systems, 21(12), pp. 1765-1778. T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M.,
[8] Camarero, C., Martinez, C. and Beivide, R., 2014. Lattice graphs for Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M.,
high-scale interconnection topologies. IEEE Transactions on Parallel Souter, J., Steinberg, , D., Swing, A., Tan, M., Thorson, G., Tian, B.,
and Distributed Systems, 26(9), pp. 2506-2519. Toma, H., Tuttle, E., Vijay Vasudevan, Walter, R., Wang, W., Wilcox,
[9] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., E., and Yoon, D.H. 2017. In-datacenter performance analysis of a
Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S. and tensor processing unit. In ACM/IEEE 44th Annual International
Schuh, P., 2022. PaLM: Scaling language modeling with pathways. Symposium on Computer Architecture (ISCA), pp. 1-12.
arXiv preprint arXiv:2204.02311. [26] Jouppi, N.P., Yoon, D.H., Kurian, G., Li, S., Patil, N., Laudon, J.,
Young, C. and Patterson, D., 2020. A domain-specific supercomputer
[10] Colamco, 2022. Mellanox QM8790 - Quantum HDR Switch,
for training deep neural networks. Communications of the ACM,
https://www.colamco.com/product/mellanox-infiniband-switch-mq 63(7), pp. 67-78.
m8790-hs2f-1503410?utm_source=froogle&utm_medium=referral. [27] Jouppi, N.P., Yoon, D.H., Ashcraft, M., Gottscho, M., Jablin, T.B.,
[11] Covington, P., Adams, J. and Sargin, E., 2016. Deep neural networks Kurian, G., Laudon, J., Li, S., Ma, P., Ma, X., Norrie, T., Patil, N.,
for Youtube recommendations. In Proceedings of the 10th ACM Prasad, S., Young, C., Zhou, Z., and Patterson, D. 2021. Ten lessons
conference on recommender systems, pp. 191-198. from three generations shaped Google’s TPUv4i. In 2021 ACM/IEEE
[12] Dally, W.J. and Towles, B.P., 2004. Principles and practices of 48th Annual International Symposium on Computer Architecture
interconnection networks. Elsevier. (ISCA), pp. 1-14.
[13] Deepmind, Nov 18, 2019, Advanced machine learning helps Play [28] Kamil, S., Pinar, A., Gunter, D., Lijewski, M., Oliker, L. and Shalf, J.,
Store users discover personalised apps. 2007, May. Reconfigurable hybrid interconnection for static and
[14] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. BERT: dynamic scientific applications. In Proceedings of the 4th Int'l
Pre-training of deep bidirectional transformers for language conference on computing frontiers, pp. 183-194.
understanding. arXiv preprint arXiv:1810.04805. [29] Kamil, S., Oliker, L., Pinar, A. and Shalf, J., 2009. Communication
[15] Dungworth, M., Harrell, J., Levine, M., Nelson, S., Oberlin, S. and requirements and interconnect optimization for high-end scientific
Reinhardt, S.P., 2011. CRAY T3E. applications. IEEE Transactions on Parallel and Distributed Systems,
[16] Fan, X., Weber, W.D. and Barroso, L.A., 2007. Power provisioning 21(2), pp. 188-202.
[30] Karandikar, S., Mao, H., Kim, D., Biancolin, D., Amid, A., Lee, D.,
for a warehouse-sized computer. In 2007 ACM/IEEE 34th
International Symposium on Computer Architecture (ISCA), Pemberton, N., Amaro, E., Schmidt, C., Chopra, A. and Huang, Q.,
pp.13-23. 2018, June. FireSim: FPGA-accelerated cycle-exact scale-out
system simulation in the public cloud. In 2018 ACM/IEEE 45th
[17] Farrington, N. ., G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Annual International Symposium on Computer Architecture (ISCA),
Subramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios: A hybrid pp. 29-42.
electrical/optical switch architecture for modular data centers. In
Proceedings of ACM SIGCOMM, Aug. 2010. [31] Khani, M., Ghobadi, M., Alizadeh, M., Zhu, Z., Glick, M., Bergman,
[18] Ghaemmaghami, B., Ozdal, M., Komuravelli, R., Korchev, D., K., Vahdat, A., Klenk, B. and Ebrahimi, E., 2021, August. SiP-ML:
high-bandwidth optical network interconnects for machine learning
Mudigere, D., Nair, K. and Naumov, M., 2022. Learning to Collide: training. In Proceedings of the 2021 ACM SIGCOMM 2021
Recommendation System Model Compression with Learned Hash Conference (pp. 657-675).
Functions. arXiv preprint arXiv:2203.15837. [32] Li, S., Tan, M., Pang, R., Li, A., Cheng, L., Le, Q.V. and Jouppi, N.P.,
[19] Google, Tracking our carbon free energy progress, 2021. Searching for Fast Model Families on Datacenter Accelerators.
https://sustainability.google/progress/energy/. In Proceedings of the IEEE/CVF Conference on Computer Vision and
[20] Google, Pattern Recognition, pp. 8085-8095.
Google Data Center Efficiency,
[33] Li, S., Andersen, G., Chen, T., Cheng, L., Grady, J., Huang, D., Le,
https://www.google.com/about/datacenters/efficiency/.
Q., Li, A., Li, X., Li, Y., Liang, C., Lu, Y., Ni, Y., Pang, F.,
[21] Graphcore, 2022. IPU-POD64 Reference Design Datasheet, Ranganathan , P., Tan, M., Wicke, M., Wu, G., Zhu, S., and Jouppi,
docs.graphcore.ai/projects/ ipu-pod64-datasheet/en/latest/ N., 2023. Hyperscale Hardware Optimized Neural Architecture
[22] Horowitz, M., 2014. Computing's energy problem (and what we can Search, In Proceedings of the 28th International conference on
Architectural Support for Programming Languages and Operating
do about it). In 2014 IEEE International Solid-State Circuits
Systems (ASPLOS).
Conference Digest of Technical Papers (ISSCC), pp. 10-14.
[34] Masanet, E., Shehabi, A., Lei, N., Smith, S. and Koomey, J., 2020.
Recalibrating global data center energy-use estimates. Science,
367(6481), pp.984-986.

13
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA

[3] Barroso, L.A., Hölzle, U. Ranganathan, P., 2018. The datacenter [23] Insight, 2022. Mellanox Quantum QM8790 - - 40 , http
as a computer: Designing warehouse-scale machines, s://www.insight.com/en_US/shop/product/MQM8790-HS2F/Mellanox/MQ
13(3) pp.i-189 [4] Bloomberg 2015 10 26 M8790-HS2F/MellanoxQuantumQM8790-swit/. [24] , 2019
AI [5] Bouknight, W.J., Denenberg, , . [25] Jouppi, N.P.,
S.A., McIntyre, D.E., Randall, J.M., Sameh, A.H. Slotnick, D.L., 1972. I Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhati
LLIAC IV Proceedings of the IEEE 60(4) pp. 369-388 [6] Bro a, S., Boden, N., Borchers, A., Boyle, R., Cantin, P., Chao, C., Clark, C., C
wn, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neel oriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T., Gotti
akantan, A., Shyam, P., Sastry, G., Askell, A. Agarwal, S., 2020. pati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt,
33 NeurIPS 2020 [7] C R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Kill
amara, J.M., Moreto, M., Vallejo, E., Beivide, R., Miguel-Alonso, J., Martí ebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Lear
nez, C. Navaridas, J., 2010. y, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony,
IEEE Transactions on Parallel and Distributed Systems 21(12) pp. 17 M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie,
65-1778 [8] Camarero, C., Martinez, C. Beivide, R., 2014. T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A.,
IEEE Transactions on Parallel and Distributed Systems Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg,
26(9) pp. 2506-2519 [9] Chowdhery, A., Narang, S., Devlin, J., Bos D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vijay
ma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Geh Vasudevan, Walter, R., Wang, W., Wilcox, E., Yoon, D.H. 2017.
rmann, S. Schuh, P., 2022. PaLM: arXiv . ACM/IEEE 44
arXiv:2204.02311 [10] Colamco, 2022. Mellanox QM8790 - Quant (ISCA) , 1-12 . [26] Jouppi, N.P., Yoon, D.H., Kurian, G.
um HDR https://www.colamco.com/product/mellanox-infiniband- , Li, S., Patil, N., Laudon, J., Young, C. Patterson, D., 2020.
switch-mq m8790-hs2f-1503410?utm_source=froogle&utm_medium=refer . Communications of the ACM, 63
ral [11] Covington, P., Adams, J. Sargin, E., 2016. YouTube (7), 67-78 . [27] Jouppi, N.P., Yoon, D.H., Ashcraft, M., Gottscho,
10 ACM pp. 191-1 M., Jablin, T.B., Kurian, G., Laudon, J., Li, S., Ma, P., Ma, X., Norrie, T., P
98 [12] Dally, W.J. Towles, B.P., 2004. Principles and practices of atil, N., Prasad, S., Young, C., Zhou, Z., Patterson, D. 2021.
interconnection networks Elsevier [13] Deepmind, 2019 11 18 Google TPUv4i. 2021 ACM/IEEE 48
Play Store [14] Devlin, J., C (ISCA) , 1-14 . [28] Kamil, S., Pinar, A.
hang, M.W., Lee, K. Toutanova, K., 2018. BERT , Gunter, D., Lijewski, M., Oliker, L. Shalf, J., 2007 5 .
arXiv arXiv:1810.04805 [15] Dungwort . 4 , 183-
h, M., Harrell, J., Levine, M., Nelson, S., Oberlin, S. Reinhardt, S.P., 20 194 . [29] Kamil, S., Oliker, L., Pinar, A. Shalf, J., 2009.
11. CRAY T3E [16] Fan, X., Weber, W.D. Barroso, L.A., 2007. .
2007 ACM/IEEE 34 IEEE Transactions on Parallel and Distributed Systems, 21(2), 188-202
ISCA pp.13-23 [17] Farrington, N., G. Porter, S. Radhakri . [30] Karandikar, S., Mao, H., Kim, D., Biancolin, D., Amid, A., Lee,
shnan, H. H. Bazzaz, V. Subramanya, Y. Fainman, G. Papen A. Vahdat. D., Pemberton, N., Amaro, E., Schmidt, C., Chopra, A. Huang, Q., 2018
Helios / ACM 6 . FireSim . 2018
SIGCOMM 2010 8 [18] Ghaemmaghami, B., Ozdal ACM/IEEE 45 (ISCA) , 29-42 .
, M., Komuravelli, R., Korchev, D., Mudigere, D., Nair, K. Naumov, M., [31] Khani, M., Ghobadi, M., Alizadeh, M., Zhu, Z., Glick, M., Bergman
2022. arXiv , K., Vahdat, A., Klenk, B. Ebrahimi, E., 2021 8 . SiP-ML
arXiv:2203.15837 [19] Google, https: . 2021 ACM SIGCOMM
//sustainability.google/progress/energy/ [20] Google, Google , 657-675 . [32] Li, S., Tan, M., Pang, R., Li, A., Cheng, L., Le,
https://www.google.com/about/datacenters/efficiency/ [21] Graphco Q.V. Jouppi, N.P., 2021. . I
re, 2022. IPU-POD64 docs.graphcore.ai/projects/ ipu-po EEE/CVF , 8085-8095 . [33] Li
d64-datasheet/en/latest/ [22] Horowitz, M., 2014. , S., Andersen, G., Chen, T., Cheng, L., Grady, J., Huang, D., Le, Q., Li, A.,
2014 IEEE Li, X., Li, Y., Liang, C., Lu, Y., Ni, Y., Pang, F., Ranganathan, P., Tan, M.,
ISSCC pp. 10-14 Wicke, M., Wu, G., Zhu, S., Jouppi, N., 2023.
, 28 (ASPLO
S) . [34] Masanet, E., Shehabi, A., Lei, N., Smith, S. Koom
ey, J., 2020. . Science, 367(6481),
984-986 .

13
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

[35] Minkenberg, C., Rodriguez, G., Prisacari, B., Schares, L., [51] Singla, A., Singh, K. Ramachandran, L. Xu, and Y. Zhang. Proteus: A
Heidelberger, P., Chen, D. and Stunkel, C., 2016, March. topology malleable data center network. In Proceedings of ACM
Performance benefits of optical circuit switches for large-scale Hotnets, Oct. 2010.
dragonfly networks. In 2016 Optical Fiber Communications [52] Taylor, P., 2022. Data center average annual power usage
Conference and Exhibition (OFC) (pp. 1-3). IEEE. effectiveness (PUE) worldwide 2007-2022, Nov 22.
[53] Teh, M. Y., S. Zhao, P. Cao, and K. Bergman. Couder: Robust
[36] MLCommons, V 2.0 Results, June 29, 2022,
https://mlcommons.org/en/training-normal-20/. topology engineering for optical circuit switched data center
[37] Mudigere, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., networks. arXiv preprint arXiv:2010.00090, Sept. 2020.
Liu, X., Ozdal, M., Nie, J., Park, J. and Luo, L., 2022, June. [54] Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha,
Software-hardware co-design for fast and scalable training of deep A., Cheng, H.T., Jin, A., Bos, T., Baker, L., Du, Y. and Li, Y., 2022.
learning recommendation models. In Proceedings of the 49th LaMDA: Language models for dialog applications. arXiv preprint
Annual International Symposium on Computer Architecture (pp. arXiv:2201.08239.
993-1011). [55] Urata, R., Liu, H., Yasumura, K., Mao, E., Berger, J., Zhou, X.,
[38] Nayak, P., 2021. MUM: A new AI milestone for understanding Lam, C., Bannon, R., Hutchinson, D., Nelson, D. and Poutievski, L.,
information. Google, May, 18. 2022. Mission Apollo: Landing Optical Circuit Switching at
[39] Norrie, T., Patil, N., Yoon, D.H., Kurian, G., Li, S., Laudon, J., Datacenter Scale. arXiv preprint arXiv:2208.10041.
Young, C., Jouppi, N. and Patterson, D., 2021. The design process for
[56] US Energy Information Agency, 2022. Oklahoma State Profile and
Google's training chips: TPUv2 and TPUv3. IEEE Micro, 41(2), pp.
56-63. Energy Estimates, https://www.eia.gov/state/
[40] Nvidia, 2020. Nvidia A100 Tensor Core GPU Architecture. ?sid=OK#:~:text=In%202021%2C%20wind%20supplied%2041,ele
ctricity%20net%20generation%20from%20wind.
[41] Nvidia, 2021. Nvidia DGX SuperPOD: Scalable Infrastructure for
[57] Vahdat, A., H. Liu, X. Zhao, and C. Johnson, 2011. The emerging
AI Leadership Reference Architecture.
optical data center. In Proceedings of the Optical Fiber
[42] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.M., Communications Conference.
Rothchild, D., So, D., Texier, M. and Dean, J., 2022. The carbon
[58] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
footprint of machine learning training will plateau then shrink, IEEE
Computer, 55(7), pp. 18-28. Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all
[43] Poutievski, L., Mashayekhi, O., Ong, J., Singh, A., Tariq, M., Wang, you need. Advances in neural information processing systems, 30
R., Zhang, J., Beauregard, V., Conner, P., Gribble, S. and Kapoor, R., (NeurIPS 2017).
2022. Jupiter evolving: transforming Google's datacenter network via [59] Wang, S., Wei, J., Sabne, A., David, A., Llbeyi, B., Hechtman, B.,
optical circuit switches and software-defined networking. In Chen, D., Murthy, K. S., Maggioni, M., Zhang, Q., Kumar, S., Guo,
Proceedings of the ACM SIGCOMM 2022 Conference, pp. 66-85. T., Xu, Y., and Zhou, Z., 2023. Overlap Communication with
[44] Raghavendra, R., Ranganathan, P., Talwar, V., Wang, Z. and Zhu, Dependent Computation via Decomposition in Large Deep Learning
X., 2008, March. No "power" struggles: coordinated multi-level Models, In Proceedings of the 28th International conference on
power management for the data center. In Proceedings of the 13th Architectural Support for Programming Languages and Operating
International Conference on architectural support for programming Systems (ASPLOS).
languages and operating systems (ASPLOS), pp. 48-59. [60] Wang, W., Khazraee, M., Zhong, Z., Ghobadi, M., Jia, Z., Mudigere,
[45] Schwartz, R., Dodge, J., Smith, N.A. and Etzioni, O., 2020. Green D., Zhang, Y. and Kewitsch, A., 2022. Topoopt: Co-optimizing
AI. Communications of the ACM, 63(12), pp. 54-63. network topology and parallelization strategy for distributed training
[46] Scott, S.L., 1996. Synchronization and communication in the T3E jobs. arXiv preprint arXiv:2202.00433.
multiprocessor. In Proceedings of the 7th International conference on [61] Williams, S., Waterman, A. and Patterson, D., 2009. Roofline: an
Architectural Support for Programming Languages and Operating insightful visual performance model for multicore architectures.
Systems (ASPLOS), pp. 26-36. Communications of the ACM, 52(4), pp.65-76.
[47] Sequin, C.H., 1981. Doubly twisted torus networks for VLSI [62] Wu, Q., Deng, Q., Ganesh, L., Hsu, C.H., Jin, Y., Kumar, S., Li, B.,
processor arrays. In Proceedings of the 8th International annual
Symposium on Computer Architecture (ISCA), pp. 471-480. Meza, J. and Song, Y.J., 2016. Dynamo: Facebook’s Data
[48] Sethi, G., Bhattacharya, P., Choudhary, D., Wu, C.-J., and Center-Wide Power Management System. 2016 ACM/ IEEE 43rd
Annual International Symposium on Computer Architecture (ISCA),
Kozyrakis, C., 2022. FlexShard: Flexible Sharding for Seoul, (2016), pp. 469–480.
Industry-Scale Sequence Recommendation Models, arxiv preprint
arXiv:2301.02959. [63] Xu, Y., Lee, H., Chen, D., Hechtman, B., Huang, Y., Joshi, R.,
[49] Shalf, J., Kamil, S., Oliker, L. and Skinner, D., 2005, November. Krikun, M., Lepikhin, D., Ly, A., Maggioni, M. and Pang, R., 2021.
GSPMD: general and scalable parallelization for ML computation
Analyzing ultra-scale application communication requirements for a graphs. arXiv preprint arXiv:2105.04663.
reconfigurable hybrid interconnect. In SC'05: Proceedings of the
2005 ACM/IEEE Conference on Supercomputing (pp. 17-17). [64] Zhao, M., Agarwal, N., Basant, A., Gedik, B., Pan, S., Ozdal, M.,
[50] Shaw, D.E., Deneroff, M.M., Dror, R.O., Kuskin, J.S., Larson, R.H., Komuravelli, R., Pan, J., Bao, T., Lu, H. and Narayanan, S., 2022,
Salmon, J.K., Young, C., Batson, B., Bowers, K.J., Chao, J.C. and June. Understanding data storage and ingestion for large-scale deep
Eastwood, M.P., 2008. Anton, a special-purpose machine for recommendation model training: Industrial product. In Proceedings
molecular dynamics simulation. Communications of the ACM, 51(7), of the 49th Annual International Symposium on Computer
pp. 91-99. Architecture (ISCA), pp. 1042-1057.

14
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.

{ "Translated Text": "[35] Minkenberg, C., Rodriguez, G., Prisacari, B., [51] Singla, A., Singh, K. Ramachandran, L. Xu, Y. Zhang. Proteus
Schares, L., Heidelberger, P., Chen, D. Stunkel, C., 2016 3 . ACM Hotnets 2010 10
2016 [52] Taylor, P., 2022. PUE 200
OFC 1-3 IEEE [36] MLCommons V 2.0 2022 7-2022 2022 11 22 [53] Teh, M. Y., S. Zhao, P. Cao, K. Berg
6 29 https://mlcommons.org/en/training-normal-20/ [37] Mudige man. Couder . arXiv
re, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., Liu, X., Ozdal, arXiv:2010.00090 2020 9 [54] Thoppilan, R., De Freitas, D.,
M., Nie, J., Park, J. Luo, L., 2022 6 Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.T., Jin, A., Bos, T., Baker,
49 L., Du, Y. Li, Y., 2022. LaMDA . arXiv
993-1011 [38] Nayak, P., 2021 MUM A arXiv:2201.08239 [55] Urata, R., Liu, H., Yasumura, K., Mao, E., Berger
I Google 2021 5 18 [39] Norrie, T., Patil, N., Yoon, D.H , J., Zhou, X., Lam, C., Bannon, R., Hutchinson, D., Nelson, D. Poutievs
., Kurian, G., Li, S., Laudon, J., Young, C., Jouppi, N. Patterson, D., 202 ki, L., 2022. . arXiv
1 Google TPUv2 TPUv3 IEEE Micro 4 arXiv:2208.10041 [56] , 2022.
1(2) 56-63 [40] Nvidia 2020 Nvidia A100 Tensor Core GPU , https://www.eia.gov/state/?sid=OK#:~:text= 2021
[41] Nvidia 2021 Nvidia DGX SuperPOD AI 41% [57] Vahdat, A., H. Liu, X.
[42] Patterson, D., Gonzalez, J., Le, Q., Liang Zhao, C. Johnson, 2011. .
, C., Munguia, L.M., Rothchild, D., So, D., Texier, M. Dean, J., 2022 [58] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
IEEE Computer 55(7 Gomez, A.N., Kaiser, . Polosukhin, I., 2017. .
) 18-28 [43] Poutievski, L., Mashayekhi, O., Ong, J., Singh, A., Ta 30 NeurIPS 2017 [59] Wang, S., Wei, J., Sabne,
riq, M., Wang, R., Zhang, J., Beauregard, V., Conner, P., Gribble, S. Ka A., David, A., Llbeyi, B., Hechtman, B., Chen, D., Murthy, K. S., Maggioni
poor, R., 2022 Jupiter , M., Zhang, Q., Kumar, S., Guo, T., Xu, Y., Zhou, Z., 2023.
Google ACM SIGCOMM 2022 66-8 , 28
5 [44] Raghavendra, R., Ranganathan, P., Talwar, V., Wang, Z. Zhu ASPLOS [60] Wang,
, X., 2008 3 W., Khazraee, M., Zhong, Z., Ghobadi, M., Jia, Z., Mudigere, D., Zhang, Y
13 ASPLOS . Kewitsch, A., 2022. Topoopt
48-59 [45] Schwartz, R., Dodge, J., Smith, N.A. Etzioni . arXiv arXiv:2202.00433 [61] Williams, S.,
, O., 2020 AI Communications of the ACM 63(12) 54-63 Waterman, A. Patterson, D., 2009. Roofline
[46] Scott, S.L., 1996 T3E 7 . Communications of the ACM, 52(4), 65-76 [
ASPLOS 2 62] Wu, Q., Deng, Q., Ganesh, L., Hsu, C.H., Jin, Y., Kumar, S., Li, B., Me
6-36 [47] Sequin, C.H., 1981 VLSI za, J. Song, Y.J., 2016. Dynamo Facebook
8 ISCA 471-480 . 2016 ACM/IEEE 43 ISCA
[48] Sethi, G., Bhattacharya, P., Choudhary, D., Wu, C.-J. Kozyrak 2016 469-480 [63] Xu, Y., Lee, H., Chen, D., Hechtman, B
is, C., 2022 FlexShard arxiv ., Huang, Y., Joshi, R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M.
arXiv:2301.02959 [49] Shalf, J., Kamil, S., Oliker, L. Skinner, D Pang, R., 2021. GSPMD ML . arXiv
., 2005 11 SC' arXiv:2105.04663 [64] Zhao, M., Agarwal, N., Basant, A., Gedik
05 2005 ACM/IEEE 17-17 [50] Shaw, D , B., Pan, S., Ozdal, M., Komuravelli, R., Pan, J., Bao, T., Lu, H. Naraya
.E., Deneroff, M.M., Dror, R.O., Kuskin, J.S., Larson, R.H., Salmon, J.K., nan, S., 2022 6 .
Young, C., Batson, B., Bowers, K.J., Chao, J.C. Eastwood, M.P., 2008 . 49 ISCA 1
Anton 042-1057
Communications of the ACM 51(7) 91-99 " }

14

You might also like