TPU v4-dual
TPU v4-dual
Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay
Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson
Google, Mountain View, CA
{jouppi,gkurian,lsheng,pcma,rahulnagarajan,lnai,nishantpatil,suvinay,aswing,btowles,cliffy,zhoux,zongweiz}@google.com
and [email protected]
·P· · · · · ·
· · · · ·
{jouppi,gkurian,lsheng,pcma,rahulnagarajan,lnai,nishantpatil,suvinay,aswing,btowles,cliffy,zhoux,zongweiz}@google.co
m [email protected]
ACM
·P· · · · ·
ML · · · ·
TPU v4 DSA · · · 2023 TPU v4
ML
50 ISCA '23 2
OCSes
023 6 17 21 ACM N
Y 14 https://doi.org/10.1145/3579371.3589350
3D Infiniband
OCSes <5%
<3% TPU v4 SparseCores
1
5 7 ML
5% 2020 TPU v4 1 7.7
TPU v3 2.1 2.7 TPU v4 large language models (LLMs)
TPU v3
4 4096 deep learning
10 OCS recommendation models DLRMs Transformers BERT
FLOPS/ ~60% LLM [6, 38, 54] ML
Graphcore IPU Bow ~4.3x 4.5x N 256 TPU v2 4096 TPU v4
vidia A100 1.2x 1.7x Nvidia A100 1.3x 1.9x
TPU v4 DNN HPC /
DSA ~2-6 CO2 Google
e ~20
TPU v4
CCS 1.
• → → → Optical Circuit Switches (OCSes)
4K 0.1% 1.0
% 1K CPU
2. DLRM SparseCore
TPU GPU IPU SC TPU v2 TPU
3.
all-reduce
2D 3D all-to-all
OCSes
Sure! Please provide the markdown source text that you would like me to translate into Chinese.
*This paper is part of the Industry Track of ISCA 2023's program. [7]
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed LLMs
for profit or commercial advantage and that copies bear this notice and the full citation TPU v4 OCSes
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the Owner/Author. LLM
ISCA '23, June 17–21, 2023, Orlando, FL, USA
© 2023 Copyright is held by the owner/author(s). TPU
ACM ISBN 979-8-4007-0095-8/23/06. [26, 39] [25, 27]
https://doi.org/10.1145/3579371.3589350
TPU v4
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
that have not yet been described. The major contributions of the suggesting 4×4×4 (64 chips) or 8×8×8 (512). With 4 TPU v4s per
paper are: CPU host, 64 TPU v4 chips and their 16 CPU hosts comfortably fit
● It describes and evaluates the first production into one rack. As 512 chips need multiple racks, a 43 building
deployment of OCSes in a supercomputer and the first to allow block was chosen.
topology reconfiguration to improve performance.
● It describes and evaluates the first accelerator support for 2.2 Construction of the TPU v4 Supercomputer
embeddings in a commercial ML system. Figure 1 shows the links from the 6 “faces” of a 43 block. There are
● It documents the rapid change in production model types 16 links per face, totaling 96 optical links per block that connect to
since 2016 for the fast changing ML field (Table 1). OCSes. To provide the wraparound links of a 3D torus, the links on
● It shows how Google uses ML to co-optimize DNN the opposing sides must connect to the same OCS. Thus, each 43
models, OCS topology, and the SparseCore. block connects to 6 × 16 ÷ 2 = 48 OCSes. The Palomar OCS is
The next section introduces OCSes and explains their many 136×136 (128 ports plus 8 spares for link testing and repairs), so
benefits. Section 3 motivates the SparseCore and shows its 48 OCSes connect the 48 pairs of cables from 64 43 blocks (each
performance gains. Section 4 uses ML to search how to 64 chips), yielding the desired total of 4096 TPU v4 chips.
co-optimize the hardware and DNN models. The next two sections
compare performance on production workloads versus TPU v3 and
then versus the Nvidia A100 and the Graphcore MK2 IPU using
MLPerf. The paper ends with a discussion, a related work section,
and a summary.
2
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
1 DNN % TPUs
90% TPU Transform
er BERT LLM 2 4 [
25] [26] [27]
2022 10 TPU v4 30
TPU v3
TPU v1 TPU v4 Lite TPU v4
4/2019
DNN Model 7/2016 2/2020 10/2022
(Training &
(Inference) (Inference) (Training)
Inference)
MLP/DLRM 61% 27% 25% 24%
RNN 29% 21% 29% 2%
CNN 5% 24% 18% 12%
Transformer -- 21% 28% 57%
(BERT) -- -- (28%) (26%)
(LLM) -- -- -- (31%)
2
TPU v3 4 TPU v3 TPU v2
4 TPU v3 2D
1 4×4×4 3 OCS +
>10 4 OCS 48
2D OCS
3D
OCS 2 TPU v4
TPU v4 TPU v3 TPU v4
TensorCores (TC) TC 128x128
2.1 Matrix Multiply Units (MXUs) 128
16 ALU Vector Processing Unit (VPU) 16 MiB
OCSes [43,54] Vector Memory (VMEM) TC 128 MiB (
Palomar OCS 3D MEMS CMEM) 4 Inter-Core Interconnect (ICI)
2×2 16 ICI
3D 3
3D 3D 16 -
4×4×4 3D
TPU
2
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
optical-to-electrical converter at the fiber connector of the reasonable slice goodput. OCSes also have fair goodput for 99.0%
destination tray. The 48 OCSes join eight rows together to form the and 99.5% for most slice sizes. Figure 4 assumes all slice size
complete 64-rack system. requests are equal, but workloads have many sizes (Table 2).
3
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
2 2022 11 TPU v4
≥ 0.1% 2.8
n×n×2n n×2n×2n
x≤y≤z
x y z 4 8
Chips <64 64
1×1×1 (1) 2.1%
2 TPU v4 ASIC HBM
1×1×2 (2) 0.4%
4 4 PCIe
1×2×2 (4) 6.7%
16 OSFP ICI Regular Tori 4×4×4 (64) 13.9%
2×2×2 (8) 4.7%
2×2×4 (16) 6.4%
2×4×4 (32) 8.9%
Total % 29% 14%
Chips 128-192 256-384
4×4×8_T
Twisted Tori 16.0% 4×8×8_T (256) 9.2%
(128)
Twistable, not 4×4×8_NT 4×8×8_NT
1.5% 1.5%
twisted Tori (128) (256)
3 4096 64 8 4×4×16 (256) 1.0%
Regular Tori 4×4×12n (192) 0.7%
4×8×12 (384) 0.1%
Total % 18% 12%
Chips 512-768 1024-1536
Twisted Tori 8×8×16_T (1K) 1.8%
Twistable, not 8×8×16_NT
1.4%
twisted Tori (1K)
4×16×16 (1K) 0.3%
8×8×8 (512) 9.6% 4×4×64 (1K) 0.1%
4×8×16 (512) 1.7% 4×8×32 (1K) 0.1%
Regular Tori 8×12×16
4×4×32 (512) 0.6% 0.1%
(1.5K)
8×8×12 (768) 0.7% 4×4×96 (1.5K) 0.1%
8×8×24 (1.5K) 0.1%
4 OCS Total % 13% 4%
goodput CPU Chips 2048-3072
Twisted Tori 8×16×16_T (2K) 1.4%
goodput
Twistable, not
4K 99.0% 99.5% goodp 8×16×16_NT (2K) 0.3%
twisted Tori
ut 75% 3 12×16×16 (3K) 5.7%
Regular Tori
4×4×192 (3K) 0.4%
4K 2K 2K Total % 8%
4K 50% 50% goodpu
t 50% 3K 4K 75% 25%
2.4 OCS .
goodput 75%
2.3 OCS OCS TPU v3 1024
OCS CPU
4 TPU v4 1000 HP TPU v4 OCS
C 43 64
slices 64 128 256 4
99.0% 99.9% TPU v4
OCS OCS
99.9%
3
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
2.5 OCS Scheduling Benefits different numbers of TPUs along each dimension. Alternatively,
[7] proposes a topology that outperforms rectangular tori with
The OCS also simplifies scheduling, which increases utilization.
lower latency and higher bisection bandwidth without increasing
For TPU v3, a 256 chip slice meant the scheduler had to find 256
switch hardware. The twisted torus rewires some links between 43
contiguous chips that were idle. For TPU v4, it can pick four 43
cubes to reduce the worst case latency. Figure 5 shows a regular
blocks from anywhere in the supercomputer. Slices don’t even
topology and a twisted topology. Since TPU v4 uses an OCS to
need to be a power of 2; they can be 4i×4j×4k, where 0 < i ≤ j ≤ k.
connect 43 blocks, the “rewiring” is mostly reprogramming of
For example, a user could request a 192 TPU v4 slice with a
routing in the OCS. Using all-to-all communication with large
geometry of 4×4×12.
messages as a microbenchmark, Figure 6 below shows the
2.6 OCS Modularity and Security Benefits performance gain from the twisted topology. The twisted torus
improves all-to-all throughput by 1.63x and 1.31x over the regular
Since the OCS can switch circuits in milliseconds, TPU v4 can torus on 4×4×8 and 4×8×8 slices, respectively. While the popular
easily change topology to match the application, the number of interconnect topologies today are Clos networks and Dragonfly
nodes, and the system that runs those jobs. TPU v4 provides networks, the twisting option reduces the worst case bisection
wraparound links of a 3D Torus for most slice sizes, which doubles bandwidth for the 3D tori and makes them more attractive for
both the bisection bandwidth and the bandwidth of important today’s supercomputers.
collective communication operations (e.g., all-reduce) versus the
mesh-like alternative [12], yet still allowing the TPU v4 to scale
interconnect bandwidth up to 163 (4096) chips. OCS also enables
an air gapped network isolation between different slices, which
enhances the security of multiple customers sharing a TPU v4
supercomputer.
4
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
2.7 OCS
DNN
1. Data Parallelism:
TPU v4
OCS OCS
[55]
2.8
TPU 6 4×4×8 4×8×8
TPU
3
8 = 512 TPU 8x8x8 DMA 4 KiB
4
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
5
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
2.9 3.3
3
2 4 O(10 MiB) O(100 GiB)
2D 29% 43 TiB
3D 71% n×n×2 TPU (1)
n n×2n×2n n≥4 33% 71% 48 column sharding (2)
% 28% 33% 86% 40 row sharding (3) table sharding
% 43
2.10 OCS
OCS TPU v4
<5% <3% 3.4
OCS
FLOPS/
3. SPARSECORE
VPU ICI
TPU
3.1
```zh DLRM
1 DLRM
YouTube Google Play [1, 4, 11, 13] Google
[1] DLRM
DLRM ```
3.5 SparseCore
3.2 SparseCore (SC)
DLRM TensorCore CPU TensorCore
N N VPU
N TensorCore /
CPU
CPU DRAM Amdahl 4:1 TPU
v4 CPU
G SC TPU v2
oogle DLRM TPU v3 TPU v4 SC
~5% ~5% SC
HBM ICI
TPU 128 TiB
5
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
v4). In contrast to all-reduces of large parameter tensors in dense 3.6 SparseCore Performance
training, the all-to-all transfers of smaller embedding vectors use
The end-to-end embedding lookup performance is essentially
HBM and ICI with finer-grained access patterns for scatter/gather.
proportional to the bisection bandwidth due to the all-to-all
As separate cores, SCs allow parallelization across dense
transfers of small embedding vectors. For the 2D torus used in
compute, SC, and ICI communications. Figure 7 shows the SC
TPU v2 and TPU v3, this bandwidth scales as N1/2 for N chips. The
block diagram, which we consider a “dataflow” architecture
3D torus in TPU v4 scales as N2/3 [12]. Figure 8 shows that the
because data flows from memory to a variety of directly connected
TPU v3/v4 bisection bandwidth ratio is 2–4x higher at a given chip
specialized compute units.
count and accelerates embeddings by 1.1x–2.0x. At 1024 chips, SC
The most general SC units are the 16 compute tiles (dark blue
overheads start to dominate, so bisection bandwidth is less
boxes in Figure 7). Each tile has an associated HBM channel and
important.
supports multiple outstanding memory accesses. Each tile has a
Figure 9 below shows performance of an internal production
Fetch Unit, a programmable 8-wide SIMD Vector Processing Unit
recommendation model (DLRM0, see Sections 7.8 and 7.9) across
(scVPU, not to be confused with VPU of the TC in TPU v4), and a
the two TPU generations for 128 chips. The standalone CPU
Flush Unit. The Fetch Unit reads activations and parameters from
configuration has 576 Skylake sockets (400 for learners and 176
the HBM into the tile’s slice of a 2.5 MiB Sparse Vector Memory
for variable servers). The bottom two bars show TPU v4 without
(Spmem). The scVPU uses the same ALUs as TC's VPU. The
SC, where the embeddings are placed in CPU memory. The “Emb
Flush Unit writes updated parameters to HBM during the
on CPU” bar places embeddings in CPU host memory and the
backward pass. In addition, the five Cross-Channel Units (gold
“Emb on Variable Server” bar places embeddings on 64 external
boxes in Figure 7) perform specific embedding operations, which
variable servers. TPU v3 is faster than CPUs by 9.8x. TPU v4
their names explain. Like TPU v1, the units execute CISC-like
beats TPU v3 by 3.1x and CPUs by 30.1x. When embeddings are
instructions and operate on variable-length inputs, where the
placed in CPU memory for TPU v4, performance drops by 5x–7x,
run-time of each instruction is data-dependent. The cross-channel
with bottlenecks due to CPU memory bandwidth.
units operate across all 16 banks of Spmem collectively.
6
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
DNN
PA-NAS DN
N TPU v4 [32] PA-NAS CN
N1 QPS NAS 1.
6 (~) [32] [33]
PA-NAS DLRM0 TPU v4
DLRM CNN
1 TC CNN
DLRM SC TC PA-NA
S SC TC
6
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
embedding layers (running on SC) and hidden layers (running on 5 PRODUCTION WORKLOAD
TC) for DLRM0, which approaches perfect SC-TC load-balance
(lower blue and red bars in Figure 10) and improves DLRM0 PERFORMANCE
end-to-end performance by >10%. This performance uplift is Table 1 above shows the workload mix for 2022 plus the history of
equivalent to improvements historically achieved by a team of >10 how it has changed over time; Section 7.7 discusses the workload
experts over about half a year, further demonstrating PA-NAS’s changes. We use 8 applications to capture the production workload
capability of increasing accelerator flexibility and performance to compare TPU v3 to TPU v4. Figure 11 shows how efficiently
gain. eight production workloads scale up on TPU v4. We are
We can also use search to tailor the TPU v4 topology to the encouraged that half of the workloads (CNN0, RNN0, RNN1, and
DNN Model. Table 3 shows gains in performance from searching BERT1) scale well to 3K chips. (Figure 4 above shows that given
for configurations for two models. The first example increased CPU hosts with availability of 99.0% to 99.9%, in practice it is
performance 2.3x by changing the geometry for a 512 TPU v4 much easier to schedule a 3K slice than a 4K slice.) For the
slice over the novice user’s initial design for an LLM. The second remainder, our production teams are aware of where the scaling
example shows a harder task, demonstrating an improvement of limits are and are constructing solutions, but those solutions are not
1.2x over an expert’s design for the pre-training phase of GPT-3. yet implemented to allow measuring performance at full 3K scale
Table 3: Improvements in performance as we vary the topology Production deployment 2020 2018
of a 512 TPU v4 slice for training an LLM and the GPT-3 Peak TFLOPS 275 (bf16 or int8) 123 (bf16)
pre-training stage. The original LLM used model parallelism of Clock Rate 1050 MHz 940 MHz
dimensions 16×32 and no pipeline or data parallelism for a 4×8×16 Tech. node, Die size 7 nm, <600 mm2 16 nm, < 700 mm2
topology. The revision changed model parallelism to 64×8 and the Transistor count 22 billion 10 billion
topology to 8×8×8. The “1D/2D activation/weight partitioning” Chips per CPU host 4 8
option is a typical way of partitioning tensors in a large model TDP N.A. N.A.
graph (see Figure 7 in [63]). For GPT-3, the original row used a Idle, min/mean/max
8×8×8 topology, pipeline parallelism of depth 8, no data 90, 121/170/192 W 123, 175/220/262 W
power
parallelism, and model parallelism of dimensions 8×8. The Inter Chip Interconnect 6 links @ 50 GB/s 4 links @ 70 GB/s
revision changed the topology to 4×8×16, pipeline depth to 16, Largest scale
data parallelism to 4, model parallelism parameters to 1×8. 4096 chips 1024 chips
configuration
Hyper-Parameters (topology, Single Instruction Single Instruction
Processor Style
partition spec [pipeline, data, 2D Data 2D Data
Through- model1, model2], 1D/2D Processors / Chip 2 2
put activation/weight Threads / Core 1 1
Case Versions (seqs/sec) partitioning) SparseCores / Chip 4 2
Novice’s 128 (CMEM) +
17.9 (1.0x)4×8×16, [1, 1, 16, 32], 2D/2D 32 MiB (VMEM) +
LLM pick On Chip Memory 32 MiB (VMEM) +
5 MiB (spMEM)
Best perf. 41.3 (2.3x)8×8×8, [1, 1, 64, 8], 1D/2D 10 MiB (spMEM)
GPT-3 Pre- Expert’s pick 21.0 (1.0x)8×8×8, [8, 1, 8, 8], 2D/2D Register File Size 0.25 MiB 0.25 MiB
training Best perf. 25.0 (1.2x)4×8×16, [16, 4, 1, 8], 1D/1D HBM2 capacity, BW 32 GiB, 1200 GB/s 32 GiB, 900 GB/s
7
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
SC TC DLRM0 5
SC-TC 10
>10% DLRM0
1 2022
>10
7.7 8
PA-NAS
TPU v3 TPU v4 11
TPU v4
TPU v4 DNN CNN0 RNN0 RNN1 BERT1
3 3K 4 CPU
512 TPU v4 99.0% 99.9% 4K
LLM 2.3 3K
GPT-3
1.2 3K
7
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
Table 4 compares key features of TPU v3 and TPU v4. Figure 13 shows results with CMEM turned off on TPU v4; it
Manufactured in 7 nm instead of 16 nm, TPU v4 has twice the contributes to 1.2x performance gain overall but 2x for RNN1. It
matrix multipliers (enabled by the increased process density) and also shows that TPU v4 has 2.1x the performance and 2.7x the
an 11% faster clock—this drives the 2.2X gain in peak performance/Watt of TPU v3; as mentioned above, ~40% of the
performance. About 40% of the performance/Watt improvement gain was from the technology and the rest from design.
was from technology and the rest was from design improvements LLM training will become a benchmark in a future MLPerf
(e.g., balancing the pipeline, implementing clock gating). The release. We omit performance of internal LLMs—31% of the
HBM memory bandwidth is 1.3x higher. Depending on the slice workload in Table 1—on TPU v3 because it is unoptimized.
size, the bisection bandwidth of TPU v4 is 2x–4x (see Figure 8 TPUv3’s 2D fixed topology hinders high-performance model
above). It also has the 128 MB on-chip CMEM scratchpad partitioning needed for LLMs. We also lack TPUv3 chip capacity
memory not found in TPU v3. to train large LLMs within reasonable time-to-convergence SLOs
Figure 12 shows how much faster TPU v4 supercomputers are given their lower FLOPS/second and suboptimal model
than TPU v3 supercomputers at the same slice size. Given the partitioning performance.
comparisons in Table 4, it’s not surprising that at the same slice
size most applications run 1.5x-2.0x faster on TPU v4 than on TPU Table 5: Features of the two DSAs [21, 40] reporting MLPerf
v3. DLRM0 is 3.0-3.5x faster and DLRM1 is 2.8x at 512 chips as 2.0 Training results besides TPU v4. The A100 has 32×108 =
TPU v4 has twice as many SCs and their clock rate is faster. The 3456 threads and the IPU has 6×1472 = 8832 threads.
surprise is RNN1; it runs 3.3x faster on TPU v4. RNN1’s small
weights and small batch size benefit significantly from CMEM Nvidia A100 Graphcore MK2 IPU
bandwidth versus HBM. Production deployment 2020 2021
Peak TFLOPS 312 (bf16), 624 (i8) 250 (bf16)
Clock Rate Base/Boost 1095 /1410 MHz 1850 MHz
Tech. node, Die size 7 nm, 826 mm2 7 nm, 832 mm2
Transistor count 54 billion 59 billion
Chips per CPU host 4 4
TDP 400 W 300 W
Inter Chip Interconnect 12 links @ 25 GB/s 3 links @ 64 GB/s
Largest scale MLPerf
4216 chips 256 chips
2.0 configuration
Single Instruction Multiple Instruction
Processor Style
Multiple Threads Multiple Data
Processors / Chip 108 1472
Threads / Core 32 6
Figure 12: Speedup of TPU v4 vs v3 for the same slice sizes.
On Chip Memory 40 MiB 900 MiB
Register File Size 27 MiB 1.40 MiB
HBM2 capacity, BW 80 GiB, 2039 GB/s 0
8
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
6 MLPERF
MLPerf DSA [36] 5
NVIDIA A100 Graphcore MK2 IPU
4 TPU v4
14 MLPerf DSA
Graphcore
13 CMEM
MLPerf
32 TPU v3
/ 512 DLRM1
15 ResNet BERT
DLRM MLPerf DLRM 7.9
8
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
interpolations based on the number of chips. The published Table 6: Mean power for DSA plus HBM for 64 chip systems
MLPerf results for TPU v4 and A100 both scale to much larger running MLPerf. Adding the switches would show even higher
systems than the IPU (4096 vs 256 chips). For similar sized efficiency for TPU v4 (Section 7.3). We use nvidia-smi reported
systems, TPU v4 is 1.15x faster for BERT than the A100 and power measurement on Azure Standard_ND96amsr_A100_v4
~4.3x faster than the IPU. For ResNet, TPU v4 is 1.67x and ~4.5x VMs during a rerun of Nvidia's MLPerf 2.0 64-chip submission.
faster, respectively. TPU v4 power measurements are done by running Google MLPerf
Table 6 shows the power we measured running MLPerf 2.0 benchmark code on 64-chip scale in Google data center. The
benchmarks; A100s use on average 1.3x–1.9x more power. TPU v4 mean power measurement is 2%–8% higher than in Table
4, but the workloads differ. MLPerf 3.0 may add power
measurements to performance in the October 2023 round.
MLPerf Benchmark A100 TPU v4 Ratio
BERT 380 W 197 W 1.93
ResNet 273 W 206 W 1.33
7 DISCUSSION
We comment next on 11 questions readers might have from our
Figure 14: Reported MLPerf Training 2.0 highest performance analysis of TPU v4 and the other DSAs.
[36] relative to A100. Each column labels the number of chips per
system. Graphcore submitted results for BERT and ResNet. TPU 7.1 Do peak FLOPS/second predict real performance?
v4 DLRM is in the research category. The MLPerf DLRM is not
Many in the ML community think peak FLOPS/second are a good
representative of production DLRMs [64] (see Section 7.9).
performance proxy [45], but they are not. For example, TPU v4 is
4.3x–4.5x faster on two MLPerf benchmarks than IPU Bow on
equal sized systems despite only having a 1.10x edge in peak
FLOPS/second. Another example is that the A100 peak
FLOPS/second rate is 1.13x TPU v4, but TPU v4 is 1.15x–1.67x
faster for the same number of chips. Figure 16 gives the
relationship between peak FLOPS/sec and memory bandwidth
using the roofline model [61].
Figure 16: Roofline models for TPU v4, TPU v3, and A100 plus
DNN models [61]. Operational intensity is in parentheses.
Figure 15: Reported MLPerf training 2.0 performance for
BERT (top) and ResNet (bottom) [36] relative to an 8-Way The A100 higher peak performance is for Boost Mode of up
A100 GPU on a log-log scale. To include IPUs, we only show to 1410 MHz; the base clock is 1095 MHz. If the average rate was
BERT and ResNet in this figure. At the largest scale of 4096 chips, 1243 MHz, the peak performance of the A100 and TPU v4 would
TPU v4 is 1.15x as fast as the Nvidia A100 for BERT. At 256 be equal (same ceiling in the roofline model). We measured the
chips, the maximum IPU size in MLPerf, TPU v4 is ~4.3x as fast A100 running MLPerf BERT, and the average clock rate was 1280
as the MK2 IPU Bow. At the largest scale, 4096 TPU v4s are 1.67x MHz due to capping of power use by Nvidia GPU software.
as fast as 4216 Nvidia A100s for ResNet. At 256 chips, TPU v4 is Amdahl’s Law reminds us that system balance—in compute,
~4.5x as fast as the MK2 IPU Bow. The points are the reported memory, and interconnect—with sufficient power and cooling to
results, and the dashed lines are interpolations for intermediate keep everything executing is still important. The integrated ICI
sizes systems. For TPU v4, the results for ≤2048 chips are from network lets TPU supercomputers scale performance gracefully
MLPerf Training 1.0; all the other points for all systems are from and the OCSes let users tailor the topology to the application to
MLPerf Training 2.0. improve performance.
9
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
7
TPU v4 DSA
14 MLPerf 2.0 A100 [36] 11
Graphcore BERT ResNet
TPU v4 DLRM MLPerf DLRM 7.1 FLOPS/
DLRM[64] 7.9 FLOPS/
[45] TPU v4 MLPerf
IPU Bow 4.3 4.5 FLOPS/
1.10 A100 FLOPS/
TPU v4 1.13 TPU v4
A100 1.15 1.67 16
[61] FLOPS/
9
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
7.2 How does OCS differ from NVLink and NVSwitch? Speaking of apples-to-apples, both TPU v4s and A100s were
deployed in 2020 and both use 7 nm technology. The newer, 700W
Every TPU since 2017 has had its own built-in router between
H100s were not available at AWS, Azure, or Google Cloud when
links (ICI) to neighboring chips in the torus. Like NVlink, it
we did the research in 2022 or even when we submitted the final
enables “glueless” connection of TPUs, but at a larger scale than
camera ready paper in 2023. The appropriate H100 match would
4-8 GPUs: 256 TPU v2s and 1024 TPU v3s.
be a successor to TPU v4 widely deployed in a similar time frame
We think of optical circuit switching as the next generation of
and technology (e.g., in 2023 and 4 nm).
ICI versus a response to the latest NVSwitch, which uses electrical
packet switching for 8 GPUs with one switch and up to 256 GPUs
with two levels of switches. 7.5 Why 30%–90% more power for A100 (Table 6)?
OCSes are just fibers connected by mirrors, so any bandwidth It is hard to find a complete quantitative answer for these two
running through a fiber can be switched between input and output complex designs. The 4x larger on-chip SRAM (160 MB versus 40
fibers by the OCS across 4096 chips today (or even more in the MB) allows memory transfers to DRAM to be in larger blocks,
future). For example, an OCS could handle multiple improving energy efficiency. Figure 13 above shows turning on the
terabits/second per link by using wavelength multiplexing. CMEM local memory, which increases on-chip SRAM from 32
Moreover, all inputs can be connected to all outputs, but the MB to 160 MB, improves performance by 1.18x and
connections must be 1:1. performance/Watt by 1.24x.
The following three qualitative factors could explain the rest
7.3 What if TPU v4 used IB versus OCS? of the gap, but we can't say definitively without additional work.
Let’s start with Infiniband (IB) versus OCS switches. Just as Support for multithreading on the GPU leads to a 100x larger
NVLink connects 8 GPUs in a DGX, 8 TPUs would use ICI. We register file (27 MiB versus 0.25 MiB), which likely requires more
follow Nvidia’s guidance by using a full 3-level fat tree for the energy for register accesses—even though the GPU uses a single
hybrid IB/ICI network [41]. At an average of one NIC per GPU, a port SRAM—as power generally increases with the square root of
1120 A100 superpod needs 164 Mellanox QM8790 40-port IB memory capacity [22]. Second, the 128x128 MXUs of TPU v4
switches [41], each priced at ~$15k–$18k [10, 23]. The 1120 IB mean each 128 entry input gets reused 128 times, whereas the 4x4
NICs are extra. To replace the 48 128-port OCSes, 4096 TPU v4s FP16 array multipliers of the A100 only get reused 4 times, leading
need 568 IB switches. An OCS is no more expensive per port than to more on-chip SRAM accesses. Finally, the ~40% larger A100
an IB switch, but it can support higher bandwidth because it is chip may have longer data buses, increasing data transmission
passively reflecting light encoded at the source. The hybrid IB/ICI energy.
option is substantially more expensive and harder for software.
Furthermore, active packet processing for an IB switch is far 7.6 What is the CO2e from TPU v4 vs other DSAs?
more power hungry than the tiny amount of power required to hold There is considerable concern about the carbon footprint of ML
the MEMS mirrors to their configured orientation in an OCS. [42, 45]. Let’s compare the cloud-only TPU v4 to a hypothetical
ICI link bandwidth is 2x IB—400 vs 200 Gbit/s—but system recent DSA in an on-premise data center.
speed is harder to judge. An internal event-driven simulator that Practitioners can reduce operational energy use and CO2
operates at the TensorFlow graph operation level evaluated a emissions by optimizing the “4Ms” [42]:
hybrid ICI/IB network. (It ignores protocol processing on the CPU, 1. Let’s assume TPU v4 and another DSA are training the
which can be significant.) Depending on the slice size, an same models, so the Model parameter is 1.0 in this case.
optimized all-reduce would run 1.8x–2.4x slower and an all-to-all 2. The Machine parameter is measured in perform-
would be 1.2x–2.4x slower. This network heterogeneity is also a ance/Watt. TPU v4 is 2.7x TPU v3. The MLPerf power plan is in
software challenge. As communication is only a portion of training progress, so we estimate for others. TPU v4 is 1.2x–1.7x faster and
time, overall IB slowdown for a DNN might be as little as ~10%. 1.3x–1.9x lower power than the A100 and 4.3x-4.5x faster than the
However, the biggest impact is losing the benefits that IPU, whose TDP is 300 W. TPU v4 chip performance/Watt is thus
originally inspired the use of OCS (Section 2): availability, scale, ~2x–6x versus a contemporary DSA; to be conservative, we
utilization, modularity, power efficiency, deployability, and so on. assume 2x for this calculation.
3. The Mechanization parameter is data center power
7.4 Nvidia announced the H100, the successor to A100, efficiency, measured as Power Usage Effectiveness (PUE). For on
in 2022. Why not compare TPU v4 to it? premise data centers, PUE is often high. Computer architects
improved PUE by helping advance the state-of-the-art of
After systems are running production applications in the field, the warehouse scale computers (WSCs) [2, 3, 16, 30, 44, 62]. For
Google tradition is to write retrospective, peer-reviewed papers for example, Google halved its average energy overhead from 21%
prominent conferences. The rationale is that the intellectual (PUE = 1.21) in 2008 to 10% (PUE = 1.10) [20]. Worldwide
incentives and deadlines for a prestigious publication encourage average PUE fell from 2.50 in 2008 to 1.57 [52] as users closed
architects working on the next generation to take the time to reflect their older data centers and switched to WSCs in the cloud [34].
and to make careful, detailed, apples-to-apples comparisons to the The relative energy consumption is then 2 × 1.57 ÷ 1.10 or
previous chip and contemporary alternatives that can pass peer 2.85x more energy (kWh) on a contemporary DSA in an average
review. The lessons learned improve future designs. The good on-premise data center versus TPU v4 in Google Cloud.
news is that these retrospective, peer-reviewed, apples-to-apples 4. Map factors in the cleanliness of the energy supply,
papers are widely read, e.g., [25] has >4000 citations. If Google which varies considerably by location. WSCs can be placed
sold chips versus using them internally, we might instead need to anywhere, while on-premise data centers depend on the local grid.
publish unreviewed whitepapers much earlier in the chip lifecycle. For example, the average portion of carbon free electrical energy
10
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
10
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
(CFE) in 2021 for the US was 40%, but it was 88% for Google’s architecture highlights the importance of having a compiler that
Oklahoma data centers [19]. They exploit the plentiful local efficiently leverages the features of the underlying DSA.
renewable energy [56]. Fortunately, Oklahoma hosts all TPU v4s
in Google Cloud. The global average conversion factor from 7.9 Is MLPerf’s DLRM benchmark (Figure 14 above)
electricity to CO2 equivalent emissions—CO2e, including realistic?
greenhouse gasses like methane—is 0.475 kg/kWh [24]. After
acquiring renewable energy in Oklahoma, matched on an hourly Production DLRM workloads scale much better [64] than MLPerf
basis with our energy consumption, it dropped to 0.074. DLRM for a few reasons. MLPerf DLRM has <2M FP32 weights
The estimated operational CO2e for a contemporary DSA in while DLRM0 has 137M Int8 weights (see Figure 17). Second, the
an average on-premise data center is then 2.85 × 0.475 ÷ 0.074 or global batch size of MLPerf DLRM is capped at 64k for optimal
~18.3x higher than training on TPU v4 in Google Cloud. The CFE model quality for this data set and optimizer, limiting batch size to
for data centers depends on the local grid plus the availability and 128 per SC on a 128-chip system (128 chips × 4 SCs/chip × 128 =
acquisition of renewable energy. Given such a large impact 64k). In contrast, internal recommendation models often reach
—crucial for anyone building or using ML infrastructure—it’s batch sizes of 2048–4096 and usefully scale up to 1024 chips (see
fortunate that Google has WSCs with very high CFE to house TPU Figure 11 above). Third, MLPerf DLRM has only 26 univalent
v4 supercomputers. features (compared to 100s in internal workloads) and no
multivalent features. For these reasons, fixed overheads per batch
7.7 How fast do ML workloads change? such as HBM latency and CISC instruction generation time on the
SC core sequencer are much higher on MLPerf DLRM than
Table 1 above shows the rapid change for production workloads at production workloads. Overheads like these limit its useful
Google. Note the drop in RNNs. Like RNNs, Transformers are scalability to ≤128 chips for TPU v4 and A100.
popular for natural language translation and text summarization,
but unlike RNNs they process the input all at once rather than
sequentially. This revised model architecture means the operations 7.10 TPU v4 has less HBM capacity than A100; could
can occur in parallel, which in turn means Transformer models can that limit LLM performance?
process much larger data sets. Two years after the Transformer Our autoML LLM configuration search (Section 4) considers
paper was published [58], it was >20% of the TPU v3 workload. HBM capacity, ICI-connected aggregated FLOPS/HBM
The Transformer models BERT and GPT-3 have also changed the bandwidth, and other model parameters (such as batch size). The
workload landscape. Two years after the BERT paper was HBM capacity could be a limiting factor in some cases, but
published [14], it was >25% of Google’s workload on TPU v4, and typically TPUv4 enables larger models to be partitioned across
it remained significant in 2022. Two years after publication of more chips with effective compute-communication overlap [59]
GPT-3 [6], LLMs were >30% of the TPU v4 production with little overhead (for example, two case studies in Table 3
workloads. ML workloads can change dramatically in the two or above). However, given the higher HBM capacity but smaller
more years it takes to design, build, and deploy a new ML NVlink-connected domain of Nvidia GPUs, Nvidia’s best LLM
supercomputer. configuration might be very different from the best one for TPU
v4.
Figure 17: Change in size of DLRM0 over time measured in 8 RELATED WORK
weights and embeddings. Each point is a new version of
DLRM0 (43 total). Each weight is 1 byte and each embedding TPU v4 uses a dedicated 3D torus interconnect. Traditional
is 4 bytes. supercomputers also employ tightly connected multiprocessors
over 3D tori with a high-bandwidth interconnect [15, 50]. Nvidia
GPU-based systems today use a two-tier network hierarchy with
7.8 Do individual DNN models also change?
NVLink and NVSwitch among groups of 4 to 256 GPUs and
Figure 17 shows the change in weights and embeddings for Infiniband beyond that.
DLRM0 from 2017 to 2022. Weights grew 4.2x and embeddings Twisted tori are not a recent invention. The ILLIAC-IV
grew 3.8x. Over those five years a new version was released every twisted one dimension of its wrap-around links for the 2D torus
~6 weeks (43 total). DLRM0 ran on all five TPU products over that it used [5]. Sequin introduced doubly-twisted tori for mapping
those five years. The velocity of change of the models and of the binary trees onto processor arrays [47]. Camara, Moreto, et al.
11
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
HBM SC CISC
7.7
MLPerf DLRM
1 Google RNN
TPU×≤128 A100 and
RNN Transformer
RNN
7.10 TPU v4 HBM A100 LLM
Transformer Trans
former [58] TPU v3 >20 autoML LLM 4 HBM ICI
% Transformer BERT GPT-3 FLOPS/HBM
BERT [14] TPU v4 HBM TP
>25% 2022 GPT-3 Uv4
[6] LLMs TPU v4 >30% - [59] 3
ML Nvidia GPU HBM
NVlink Nvidia LLM TPUv4
7.11 DSA
DSA
7.7 7.8
RNN RNN
TPU v4
LLMs [Bro2
2]
17 DLRM0 8
DLRM0 43
TPU v4 3D
1 4
3D [15, 50]
Nvidia GPU
7.8 DNN NVLink NVSwitch 4 256 GPU Infin
17 2017 2022 DLRM0 iband
4.2 3.8 ~6 ILLIAC-I
43 DLRM0 V [5
TPU and ] Sequin
[47] Camara Moreto
11
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
made the case for twisting 3D tori [7], and TPU v4 follows the Moreover, replacing OCS and ICI with Infiniband increases
k×2k×2k configuration from Camarero, Martinez, and Beivide [8]. costs, raises power consumption, and degrades performance.
Shalf et al. and Kamil et al. proposed a mix of circuit TPU v4 is faster and lower power than contemporary DSA
switching and packet switching for use in traditional chips made using similar technologies deployed close to the same
supercomputers [49,28] and Kamil et al. later suggested that a time and for similar sized systems. The power edge might be even
MEMS-based OCS could provide circuit switching [29]. A recent larger if the interconnects are included.
paper has a similar investigation plus topology and parallelization Training time of LLMs is greatly reduced over TPU v3 by
co-optimizations to accelerate ML [60]. For data centers, there using 3K TPU v4 slices with their 3D torus topology. The
have been many proposals for OCS-based networks [17, 31, 35, performance, scalability, and availability make TPU v4
51, 53, 57]. Some even include the concept of topology supercomputers the workhorses of large language models (LLMs)
engineering. However, all of this related work are paper designs or like LaMDA, MUM, and PaLM [54, 38, 9]. These features allowed
proof-of-concept, testbed scale demonstrations, in contrast to the the 540B parameter PaLM model to sustain a remarkable 57.8% of
widespread deployment of OCSes at Google for data center the peak hardware floating point performance over 50 days while
networks and for supercomputer interconnect. We believe that TPU training on TPU v4 supercomputers [9].
v4 is the first commercial supercomputer built using OCS and the Google has deployed dozens of TPU v4 supercomputers,
first supercomputer built with a reconfigurable interconnect that including eight for external use via Google Cloud. Moreover, the
enhances performance. large size of the TPU v4 supercomputer and its reliance on OCSes
The TPU v4 supercomputer implements a logically shared looks prescient given that the design began two years before the
address space across physical chips. Software explicitly controls paper was published that has stoked the enthusiasm for LLMs [6].
access and data movement; remote memories are available through Advances by computer architects in the state-of-the-art of
asynchronous DMA writes only. The Cray T3E enabled a similar warehouse scale computing (WSCs) save energy and thus help
logically shared address space, with bulk asynchronous reads and reduce the carbon footprint of ML. When energy-efficient TPU v4
writes, load-store access to remote memories, and a rich suite of supercomputers are housed inside energy-efficient WSCs that rely
atomic operations [46]. The TPU v4 memory system is tailored for on ~90% carbon free electricity, they can consume only ~⅙–½ of
high performance, with each chip maintaining tens of thousands of the energy and produce only ~5% of the operational CO2e from
outstanding memory requests (to both local and remote memories). training on a typical contemporary ML DSA in the average
Acceleration of embeddings is key for DLRMs used for on-premise data center. A ~20x reduction in carbon footprint
business-critical applications, and we believe Google was the first greatly increases the chances of delivering on the amazing
to include on-ASIC hardware support in TPU v2, deployed in potential of ML in a sustainable manner [42].
2017. Neo from Facebook (Meta) [37] trains embedding tables
with up to 12T parameters using a 128-GPU system. Neo also ACKNOWLEDGEMENTS
exploits table, row, column and data parallelism; overlaps
communication and compute; and improves kernel fusion. Nvidia’s Special thanks go to the Google Platforms Optics team that kept
MLPerf entries use similar techniques, e.g., custom fused kernels advancing the state-of-the-art of OCSes and optical transceivers
for reduction over Infiniband. Two recent papers present other after other organizations gave up. We also thank John Hennessy,
optimizations for embeddings [18, 48]. Robert Hundt, Christos Kozyrakis, Sridhar Lakshmanamurthy, Jae
W. Lee, Hong Liu, Aamer Mahmood, Partha Ranganathan, Adrian
Sampson, Amin Vahdat, Amir Yazdanbakhsh, George Yuan, and
9 SUMMARY the anonymous ISCA reviewers for their feedback and suggestions
Two major architectural features of TPU v4 have small cost but on this paper.
outsized advantages. The SparseCore accelerates embeddings of We finally wish to thank our collaborators who helped with the
DLRM models by 5x-7x by providing a dataflow sea-of-cores measurements for this paper—Ben Albrecht, Jinliang Wei, Shibo
architecture that allows embeddings to be placed anywhere in the Wang, Yuechao Pan, and Ziqiang Feng for Table 3 and
128 TiB physical memory of the TPU v4 supercomputer. This gain Lluis-Miquel Munguia, Hao Wu, and Yuechao Pan for Table
comes at the cost of only ~5% in die area and power. 6—and the many engineers who developed the TPU v4 chip,
The OCSes and underlying optical components at the heart of hardware, software, and the many teams contributing to its
the TPU v4 supercomputer are relatively inexpensive at <5% of deployment, including but not limited to: Zach Cotton, Pedram
overall costs and <3% of overall power consumption, yet it Dashti, Bill Edwards, Hema Hariharan, Chetan Kale, Georgios
provides a remarkable set of eight benefits: Konstadinidis, Alan Kulawik, Justin Lee, Hong Liu, Erji Mao,
1. Scalability. Omkar Pathak, Erick Tuttle, Daoyi Wang, Kevin Yasumura, and
2. Improved availability, which enables the TPU v4 Sara Zebian.
supercomputer to be 4x larger than TPU v3.
3. Modularity, allowing the faster 3D torus topology from REFERENCES
64 to 3072 chips and novel shapes like twisted tori.
4. Higher performance, as users can pick the topology that [1] Anil, R., Gadanho, S., Huang, D., Jacob, N., Li, Z., Lin, D., Phillips,
T., Pop, C., Regan, K., Shamir, G.I. and Shivanna, R., 2022. On the
is best for their application.
Factory Floor: ML Engineering for Industrial-Scale Ads
5. Diminished power, as MEMS optical circuit switching is Recommendation Models. 16th ACM Conference on Recommender
more energy efficient than electronic packet switching. Systems.
6. Simplified scheduling to improve utilization. [2] Barroso, L.A., and Hölzle, U., 2009, First Edition. The datacenter as
7. Faster deployment, for better return on investment. a computer: An introduction to the design of warehouse-scale
8. Enhanced security, which encourages different machines. Synthesis lectures on computer architecture, 6(3),
organizations to share use of TPU v4 supercomputers. pp.1-120.
12
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
12
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
[3] Barroso, L.A., Hölzle, U. and Ranganathan, P., 2018. The datacenter [23] Insight, 2022. Mellanox Quantum QM8790 - switch - 40 ports,
as a computer: Designing warehouse-scale machines, Third Edition. https://www.insight.com/en_US/shop/
Synthesis Lectures on Computer Architecture, 13(3), pp.i-189. product/MQM8790-HS2F/Mellanox/MQM8790-HS2F/MellanoxQua
[4] Bloomberg October 26, 2015. Google turning its lucrative web search ntumQM8790-swit/.
over to AI machines. [24] International Energy Agency, Global Energy & CO2 Status Report
[5] Bouknight, W.J., Denenberg, S.A., McIntyre, D.E., Randall, J.M., 2019, Report Extract Emissions.
Sameh, A.H. and Slotnick, D.L., 1972. The ILLIAC IV system. [25] Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa,
Proceedings of the IEEE, 60(4), pp. 369-388. R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin,
[6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb,
P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., B., Ghaemmaghami, T., Gottipati, R., Gulland, W., Hagmann, R., Ho,
2020. Language models are few-shot learners. Advances in Neural C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A.,
Information Processing Systems 33 (NeurIPS 2020). Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A.,
[7] Camara, J.M., Moreto, M., Vallejo, E., Beivide, R., Miguel-Alonso, Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z.,
J., Martínez, C. and Navaridas, J., 2010. Twisted torus topologies for Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M.,
enhanced interconnection networks. IEEE Transactions on Parallel Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie,
and Distributed Systems, 21(12), pp. 1765-1778. T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M.,
[8] Camarero, C., Martinez, C. and Beivide, R., 2014. Lattice graphs for Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M.,
high-scale interconnection topologies. IEEE Transactions on Parallel Souter, J., Steinberg, , D., Swing, A., Tan, M., Thorson, G., Tian, B.,
and Distributed Systems, 26(9), pp. 2506-2519. Toma, H., Tuttle, E., Vijay Vasudevan, Walter, R., Wang, W., Wilcox,
[9] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., E., and Yoon, D.H. 2017. In-datacenter performance analysis of a
Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S. and tensor processing unit. In ACM/IEEE 44th Annual International
Schuh, P., 2022. PaLM: Scaling language modeling with pathways. Symposium on Computer Architecture (ISCA), pp. 1-12.
arXiv preprint arXiv:2204.02311. [26] Jouppi, N.P., Yoon, D.H., Kurian, G., Li, S., Patil, N., Laudon, J.,
Young, C. and Patterson, D., 2020. A domain-specific supercomputer
[10] Colamco, 2022. Mellanox QM8790 - Quantum HDR Switch,
for training deep neural networks. Communications of the ACM,
https://www.colamco.com/product/mellanox-infiniband-switch-mq 63(7), pp. 67-78.
m8790-hs2f-1503410?utm_source=froogle&utm_medium=referral. [27] Jouppi, N.P., Yoon, D.H., Ashcraft, M., Gottscho, M., Jablin, T.B.,
[11] Covington, P., Adams, J. and Sargin, E., 2016. Deep neural networks Kurian, G., Laudon, J., Li, S., Ma, P., Ma, X., Norrie, T., Patil, N.,
for Youtube recommendations. In Proceedings of the 10th ACM Prasad, S., Young, C., Zhou, Z., and Patterson, D. 2021. Ten lessons
conference on recommender systems, pp. 191-198. from three generations shaped Google’s TPUv4i. In 2021 ACM/IEEE
[12] Dally, W.J. and Towles, B.P., 2004. Principles and practices of 48th Annual International Symposium on Computer Architecture
interconnection networks. Elsevier. (ISCA), pp. 1-14.
[13] Deepmind, Nov 18, 2019, Advanced machine learning helps Play [28] Kamil, S., Pinar, A., Gunter, D., Lijewski, M., Oliker, L. and Shalf, J.,
Store users discover personalised apps. 2007, May. Reconfigurable hybrid interconnection for static and
[14] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. BERT: dynamic scientific applications. In Proceedings of the 4th Int'l
Pre-training of deep bidirectional transformers for language conference on computing frontiers, pp. 183-194.
understanding. arXiv preprint arXiv:1810.04805. [29] Kamil, S., Oliker, L., Pinar, A. and Shalf, J., 2009. Communication
[15] Dungworth, M., Harrell, J., Levine, M., Nelson, S., Oberlin, S. and requirements and interconnect optimization for high-end scientific
Reinhardt, S.P., 2011. CRAY T3E. applications. IEEE Transactions on Parallel and Distributed Systems,
[16] Fan, X., Weber, W.D. and Barroso, L.A., 2007. Power provisioning 21(2), pp. 188-202.
[30] Karandikar, S., Mao, H., Kim, D., Biancolin, D., Amid, A., Lee, D.,
for a warehouse-sized computer. In 2007 ACM/IEEE 34th
International Symposium on Computer Architecture (ISCA), Pemberton, N., Amaro, E., Schmidt, C., Chopra, A. and Huang, Q.,
pp.13-23. 2018, June. FireSim: FPGA-accelerated cycle-exact scale-out
system simulation in the public cloud. In 2018 ACM/IEEE 45th
[17] Farrington, N. ., G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Annual International Symposium on Computer Architecture (ISCA),
Subramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios: A hybrid pp. 29-42.
electrical/optical switch architecture for modular data centers. In
Proceedings of ACM SIGCOMM, Aug. 2010. [31] Khani, M., Ghobadi, M., Alizadeh, M., Zhu, Z., Glick, M., Bergman,
[18] Ghaemmaghami, B., Ozdal, M., Komuravelli, R., Korchev, D., K., Vahdat, A., Klenk, B. and Ebrahimi, E., 2021, August. SiP-ML:
high-bandwidth optical network interconnects for machine learning
Mudigere, D., Nair, K. and Naumov, M., 2022. Learning to Collide: training. In Proceedings of the 2021 ACM SIGCOMM 2021
Recommendation System Model Compression with Learned Hash Conference (pp. 657-675).
Functions. arXiv preprint arXiv:2203.15837. [32] Li, S., Tan, M., Pang, R., Li, A., Cheng, L., Le, Q.V. and Jouppi, N.P.,
[19] Google, Tracking our carbon free energy progress, 2021. Searching for Fast Model Families on Datacenter Accelerators.
https://sustainability.google/progress/energy/. In Proceedings of the IEEE/CVF Conference on Computer Vision and
[20] Google, Pattern Recognition, pp. 8085-8095.
Google Data Center Efficiency,
[33] Li, S., Andersen, G., Chen, T., Cheng, L., Grady, J., Huang, D., Le,
https://www.google.com/about/datacenters/efficiency/.
Q., Li, A., Li, X., Li, Y., Liang, C., Lu, Y., Ni, Y., Pang, F.,
[21] Graphcore, 2022. IPU-POD64 Reference Design Datasheet, Ranganathan , P., Tan, M., Wicke, M., Wu, G., Zhu, S., and Jouppi,
docs.graphcore.ai/projects/ ipu-pod64-datasheet/en/latest/ N., 2023. Hyperscale Hardware Optimized Neural Architecture
[22] Horowitz, M., 2014. Computing's energy problem (and what we can Search, In Proceedings of the 28th International conference on
Architectural Support for Programming Languages and Operating
do about it). In 2014 IEEE International Solid-State Circuits
Systems (ASPLOS).
Conference Digest of Technical Papers (ISSCC), pp. 10-14.
[34] Masanet, E., Shehabi, A., Lei, N., Smith, S. and Koomey, J., 2020.
Recalibrating global data center energy-use estimates. Science,
367(6481), pp.984-986.
13
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings
Industrial Product ISCA ‘23, June 17–21, 2023, Orlando, FL, USA
[3] Barroso, L.A., Hölzle, U. Ranganathan, P., 2018. The datacenter [23] Insight, 2022. Mellanox Quantum QM8790 - - 40 , http
as a computer: Designing warehouse-scale machines, s://www.insight.com/en_US/shop/product/MQM8790-HS2F/Mellanox/MQ
13(3) pp.i-189 [4] Bloomberg 2015 10 26 M8790-HS2F/MellanoxQuantumQM8790-swit/. [24] , 2019
AI [5] Bouknight, W.J., Denenberg, , . [25] Jouppi, N.P.,
S.A., McIntyre, D.E., Randall, J.M., Sameh, A.H. Slotnick, D.L., 1972. I Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhati
LLIAC IV Proceedings of the IEEE 60(4) pp. 369-388 [6] Bro a, S., Boden, N., Borchers, A., Boyle, R., Cantin, P., Chao, C., Clark, C., C
wn, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neel oriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T., Gotti
akantan, A., Shyam, P., Sastry, G., Askell, A. Agarwal, S., 2020. pati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt,
33 NeurIPS 2020 [7] C R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Kill
amara, J.M., Moreto, M., Vallejo, E., Beivide, R., Miguel-Alonso, J., Martí ebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Lear
nez, C. Navaridas, J., 2010. y, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony,
IEEE Transactions on Parallel and Distributed Systems 21(12) pp. 17 M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie,
65-1778 [8] Camarero, C., Martinez, C. Beivide, R., 2014. T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A.,
IEEE Transactions on Parallel and Distributed Systems Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg,
26(9) pp. 2506-2519 [9] Chowdhery, A., Narang, S., Devlin, J., Bos D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vijay
ma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Geh Vasudevan, Walter, R., Wang, W., Wilcox, E., Yoon, D.H. 2017.
rmann, S. Schuh, P., 2022. PaLM: arXiv . ACM/IEEE 44
arXiv:2204.02311 [10] Colamco, 2022. Mellanox QM8790 - Quant (ISCA) , 1-12 . [26] Jouppi, N.P., Yoon, D.H., Kurian, G.
um HDR https://www.colamco.com/product/mellanox-infiniband- , Li, S., Patil, N., Laudon, J., Young, C. Patterson, D., 2020.
switch-mq m8790-hs2f-1503410?utm_source=froogle&utm_medium=refer . Communications of the ACM, 63
ral [11] Covington, P., Adams, J. Sargin, E., 2016. YouTube (7), 67-78 . [27] Jouppi, N.P., Yoon, D.H., Ashcraft, M., Gottscho,
10 ACM pp. 191-1 M., Jablin, T.B., Kurian, G., Laudon, J., Li, S., Ma, P., Ma, X., Norrie, T., P
98 [12] Dally, W.J. Towles, B.P., 2004. Principles and practices of atil, N., Prasad, S., Young, C., Zhou, Z., Patterson, D. 2021.
interconnection networks Elsevier [13] Deepmind, 2019 11 18 Google TPUv4i. 2021 ACM/IEEE 48
Play Store [14] Devlin, J., C (ISCA) , 1-14 . [28] Kamil, S., Pinar, A.
hang, M.W., Lee, K. Toutanova, K., 2018. BERT , Gunter, D., Lijewski, M., Oliker, L. Shalf, J., 2007 5 .
arXiv arXiv:1810.04805 [15] Dungwort . 4 , 183-
h, M., Harrell, J., Levine, M., Nelson, S., Oberlin, S. Reinhardt, S.P., 20 194 . [29] Kamil, S., Oliker, L., Pinar, A. Shalf, J., 2009.
11. CRAY T3E [16] Fan, X., Weber, W.D. Barroso, L.A., 2007. .
2007 ACM/IEEE 34 IEEE Transactions on Parallel and Distributed Systems, 21(2), 188-202
ISCA pp.13-23 [17] Farrington, N., G. Porter, S. Radhakri . [30] Karandikar, S., Mao, H., Kim, D., Biancolin, D., Amid, A., Lee,
shnan, H. H. Bazzaz, V. Subramanya, Y. Fainman, G. Papen A. Vahdat. D., Pemberton, N., Amaro, E., Schmidt, C., Chopra, A. Huang, Q., 2018
Helios / ACM 6 . FireSim . 2018
SIGCOMM 2010 8 [18] Ghaemmaghami, B., Ozdal ACM/IEEE 45 (ISCA) , 29-42 .
, M., Komuravelli, R., Korchev, D., Mudigere, D., Nair, K. Naumov, M., [31] Khani, M., Ghobadi, M., Alizadeh, M., Zhu, Z., Glick, M., Bergman
2022. arXiv , K., Vahdat, A., Klenk, B. Ebrahimi, E., 2021 8 . SiP-ML
arXiv:2203.15837 [19] Google, https: . 2021 ACM SIGCOMM
//sustainability.google/progress/energy/ [20] Google, Google , 657-675 . [32] Li, S., Tan, M., Pang, R., Li, A., Cheng, L., Le,
https://www.google.com/about/datacenters/efficiency/ [21] Graphco Q.V. Jouppi, N.P., 2021. . I
re, 2022. IPU-POD64 docs.graphcore.ai/projects/ ipu-po EEE/CVF , 8085-8095 . [33] Li
d64-datasheet/en/latest/ [22] Horowitz, M., 2014. , S., Andersen, G., Chen, T., Cheng, L., Grady, J., Huang, D., Le, Q., Li, A.,
2014 IEEE Li, X., Li, Y., Liang, C., Lu, Y., Ni, Y., Pang, F., Ranganathan, P., Tan, M.,
ISSCC pp. 10-14 Wicke, M., Wu, G., Zhu, S., Jouppi, N., 2023.
, 28 (ASPLO
S) . [34] Masanet, E., Shehabi, A., Lei, N., Smith, S. Koom
ey, J., 2020. . Science, 367(6481),
984-986 .
13
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
[35] Minkenberg, C., Rodriguez, G., Prisacari, B., Schares, L., [51] Singla, A., Singh, K. Ramachandran, L. Xu, and Y. Zhang. Proteus: A
Heidelberger, P., Chen, D. and Stunkel, C., 2016, March. topology malleable data center network. In Proceedings of ACM
Performance benefits of optical circuit switches for large-scale Hotnets, Oct. 2010.
dragonfly networks. In 2016 Optical Fiber Communications [52] Taylor, P., 2022. Data center average annual power usage
Conference and Exhibition (OFC) (pp. 1-3). IEEE. effectiveness (PUE) worldwide 2007-2022, Nov 22.
[53] Teh, M. Y., S. Zhao, P. Cao, and K. Bergman. Couder: Robust
[36] MLCommons, V 2.0 Results, June 29, 2022,
https://mlcommons.org/en/training-normal-20/. topology engineering for optical circuit switched data center
[37] Mudigere, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., networks. arXiv preprint arXiv:2010.00090, Sept. 2020.
Liu, X., Ozdal, M., Nie, J., Park, J. and Luo, L., 2022, June. [54] Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha,
Software-hardware co-design for fast and scalable training of deep A., Cheng, H.T., Jin, A., Bos, T., Baker, L., Du, Y. and Li, Y., 2022.
learning recommendation models. In Proceedings of the 49th LaMDA: Language models for dialog applications. arXiv preprint
Annual International Symposium on Computer Architecture (pp. arXiv:2201.08239.
993-1011). [55] Urata, R., Liu, H., Yasumura, K., Mao, E., Berger, J., Zhou, X.,
[38] Nayak, P., 2021. MUM: A new AI milestone for understanding Lam, C., Bannon, R., Hutchinson, D., Nelson, D. and Poutievski, L.,
information. Google, May, 18. 2022. Mission Apollo: Landing Optical Circuit Switching at
[39] Norrie, T., Patil, N., Yoon, D.H., Kurian, G., Li, S., Laudon, J., Datacenter Scale. arXiv preprint arXiv:2208.10041.
Young, C., Jouppi, N. and Patterson, D., 2021. The design process for
[56] US Energy Information Agency, 2022. Oklahoma State Profile and
Google's training chips: TPUv2 and TPUv3. IEEE Micro, 41(2), pp.
56-63. Energy Estimates, https://www.eia.gov/state/
[40] Nvidia, 2020. Nvidia A100 Tensor Core GPU Architecture. ?sid=OK#:~:text=In%202021%2C%20wind%20supplied%2041,ele
ctricity%20net%20generation%20from%20wind.
[41] Nvidia, 2021. Nvidia DGX SuperPOD: Scalable Infrastructure for
[57] Vahdat, A., H. Liu, X. Zhao, and C. Johnson, 2011. The emerging
AI Leadership Reference Architecture.
optical data center. In Proceedings of the Optical Fiber
[42] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.M., Communications Conference.
Rothchild, D., So, D., Texier, M. and Dean, J., 2022. The carbon
[58] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
footprint of machine learning training will plateau then shrink, IEEE
Computer, 55(7), pp. 18-28. Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all
[43] Poutievski, L., Mashayekhi, O., Ong, J., Singh, A., Tariq, M., Wang, you need. Advances in neural information processing systems, 30
R., Zhang, J., Beauregard, V., Conner, P., Gribble, S. and Kapoor, R., (NeurIPS 2017).
2022. Jupiter evolving: transforming Google's datacenter network via [59] Wang, S., Wei, J., Sabne, A., David, A., Llbeyi, B., Hechtman, B.,
optical circuit switches and software-defined networking. In Chen, D., Murthy, K. S., Maggioni, M., Zhang, Q., Kumar, S., Guo,
Proceedings of the ACM SIGCOMM 2022 Conference, pp. 66-85. T., Xu, Y., and Zhou, Z., 2023. Overlap Communication with
[44] Raghavendra, R., Ranganathan, P., Talwar, V., Wang, Z. and Zhu, Dependent Computation via Decomposition in Large Deep Learning
X., 2008, March. No "power" struggles: coordinated multi-level Models, In Proceedings of the 28th International conference on
power management for the data center. In Proceedings of the 13th Architectural Support for Programming Languages and Operating
International Conference on architectural support for programming Systems (ASPLOS).
languages and operating systems (ASPLOS), pp. 48-59. [60] Wang, W., Khazraee, M., Zhong, Z., Ghobadi, M., Jia, Z., Mudigere,
[45] Schwartz, R., Dodge, J., Smith, N.A. and Etzioni, O., 2020. Green D., Zhang, Y. and Kewitsch, A., 2022. Topoopt: Co-optimizing
AI. Communications of the ACM, 63(12), pp. 54-63. network topology and parallelization strategy for distributed training
[46] Scott, S.L., 1996. Synchronization and communication in the T3E jobs. arXiv preprint arXiv:2202.00433.
multiprocessor. In Proceedings of the 7th International conference on [61] Williams, S., Waterman, A. and Patterson, D., 2009. Roofline: an
Architectural Support for Programming Languages and Operating insightful visual performance model for multicore architectures.
Systems (ASPLOS), pp. 26-36. Communications of the ACM, 52(4), pp.65-76.
[47] Sequin, C.H., 1981. Doubly twisted torus networks for VLSI [62] Wu, Q., Deng, Q., Ganesh, L., Hsu, C.H., Jin, Y., Kumar, S., Li, B.,
processor arrays. In Proceedings of the 8th International annual
Symposium on Computer Architecture (ISCA), pp. 471-480. Meza, J. and Song, Y.J., 2016. Dynamo: Facebook’s Data
[48] Sethi, G., Bhattacharya, P., Choudhary, D., Wu, C.-J., and Center-Wide Power Management System. 2016 ACM/ IEEE 43rd
Annual International Symposium on Computer Architecture (ISCA),
Kozyrakis, C., 2022. FlexShard: Flexible Sharding for Seoul, (2016), pp. 469–480.
Industry-Scale Sequence Recommendation Models, arxiv preprint
arXiv:2301.02959. [63] Xu, Y., Lee, H., Chen, D., Hechtman, B., Huang, Y., Joshi, R.,
[49] Shalf, J., Kamil, S., Oliker, L. and Skinner, D., 2005, November. Krikun, M., Lepikhin, D., Ly, A., Maggioni, M. and Pang, R., 2021.
GSPMD: general and scalable parallelization for ML computation
Analyzing ultra-scale application communication requirements for a graphs. arXiv preprint arXiv:2105.04663.
reconfigurable hybrid interconnect. In SC'05: Proceedings of the
2005 ACM/IEEE Conference on Supercomputing (pp. 17-17). [64] Zhao, M., Agarwal, N., Basant, A., Gedik, B., Pan, S., Ozdal, M.,
[50] Shaw, D.E., Deneroff, M.M., Dror, R.O., Kuskin, J.S., Larson, R.H., Komuravelli, R., Pan, J., Bao, T., Lu, H. and Narayanan, S., 2022,
Salmon, J.K., Young, C., Batson, B., Bowers, K.J., Chao, J.C. and June. Understanding data storage and ingestion for large-scale deep
Eastwood, M.P., 2008. Anton, a special-purpose machine for recommendation model training: Industrial product. In Proceedings
molecular dynamics simulation. Communications of the ACM, 51(7), of the 49th Annual International Symposium on Computer
pp. 91-99. Architecture (ISCA), pp. 1042-1057.
14
ISCA ‘23, June 17–21, 2023, Orlando, FL, USA Jouppi, et al.
{ "Translated Text": "[35] Minkenberg, C., Rodriguez, G., Prisacari, B., [51] Singla, A., Singh, K. Ramachandran, L. Xu, Y. Zhang. Proteus
Schares, L., Heidelberger, P., Chen, D. Stunkel, C., 2016 3 . ACM Hotnets 2010 10
2016 [52] Taylor, P., 2022. PUE 200
OFC 1-3 IEEE [36] MLCommons V 2.0 2022 7-2022 2022 11 22 [53] Teh, M. Y., S. Zhao, P. Cao, K. Berg
6 29 https://mlcommons.org/en/training-normal-20/ [37] Mudige man. Couder . arXiv
re, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., Liu, X., Ozdal, arXiv:2010.00090 2020 9 [54] Thoppilan, R., De Freitas, D.,
M., Nie, J., Park, J. Luo, L., 2022 6 Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.T., Jin, A., Bos, T., Baker,
49 L., Du, Y. Li, Y., 2022. LaMDA . arXiv
993-1011 [38] Nayak, P., 2021 MUM A arXiv:2201.08239 [55] Urata, R., Liu, H., Yasumura, K., Mao, E., Berger
I Google 2021 5 18 [39] Norrie, T., Patil, N., Yoon, D.H , J., Zhou, X., Lam, C., Bannon, R., Hutchinson, D., Nelson, D. Poutievs
., Kurian, G., Li, S., Laudon, J., Young, C., Jouppi, N. Patterson, D., 202 ki, L., 2022. . arXiv
1 Google TPUv2 TPUv3 IEEE Micro 4 arXiv:2208.10041 [56] , 2022.
1(2) 56-63 [40] Nvidia 2020 Nvidia A100 Tensor Core GPU , https://www.eia.gov/state/?sid=OK#:~:text= 2021
[41] Nvidia 2021 Nvidia DGX SuperPOD AI 41% [57] Vahdat, A., H. Liu, X.
[42] Patterson, D., Gonzalez, J., Le, Q., Liang Zhao, C. Johnson, 2011. .
, C., Munguia, L.M., Rothchild, D., So, D., Texier, M. Dean, J., 2022 [58] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
IEEE Computer 55(7 Gomez, A.N., Kaiser, . Polosukhin, I., 2017. .
) 18-28 [43] Poutievski, L., Mashayekhi, O., Ong, J., Singh, A., Ta 30 NeurIPS 2017 [59] Wang, S., Wei, J., Sabne,
riq, M., Wang, R., Zhang, J., Beauregard, V., Conner, P., Gribble, S. Ka A., David, A., Llbeyi, B., Hechtman, B., Chen, D., Murthy, K. S., Maggioni
poor, R., 2022 Jupiter , M., Zhang, Q., Kumar, S., Guo, T., Xu, Y., Zhou, Z., 2023.
Google ACM SIGCOMM 2022 66-8 , 28
5 [44] Raghavendra, R., Ranganathan, P., Talwar, V., Wang, Z. Zhu ASPLOS [60] Wang,
, X., 2008 3 W., Khazraee, M., Zhong, Z., Ghobadi, M., Jia, Z., Mudigere, D., Zhang, Y
13 ASPLOS . Kewitsch, A., 2022. Topoopt
48-59 [45] Schwartz, R., Dodge, J., Smith, N.A. Etzioni . arXiv arXiv:2202.00433 [61] Williams, S.,
, O., 2020 AI Communications of the ACM 63(12) 54-63 Waterman, A. Patterson, D., 2009. Roofline
[46] Scott, S.L., 1996 T3E 7 . Communications of the ACM, 52(4), 65-76 [
ASPLOS 2 62] Wu, Q., Deng, Q., Ganesh, L., Hsu, C.H., Jin, Y., Kumar, S., Li, B., Me
6-36 [47] Sequin, C.H., 1981 VLSI za, J. Song, Y.J., 2016. Dynamo Facebook
8 ISCA 471-480 . 2016 ACM/IEEE 43 ISCA
[48] Sethi, G., Bhattacharya, P., Choudhary, D., Wu, C.-J. Kozyrak 2016 469-480 [63] Xu, Y., Lee, H., Chen, D., Hechtman, B
is, C., 2022 FlexShard arxiv ., Huang, Y., Joshi, R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M.
arXiv:2301.02959 [49] Shalf, J., Kamil, S., Oliker, L. Skinner, D Pang, R., 2021. GSPMD ML . arXiv
., 2005 11 SC' arXiv:2105.04663 [64] Zhao, M., Agarwal, N., Basant, A., Gedik
05 2005 ACM/IEEE 17-17 [50] Shaw, D , B., Pan, S., Ozdal, M., Komuravelli, R., Pan, J., Bao, T., Lu, H. Naraya
.E., Deneroff, M.M., Dror, R.O., Kuskin, J.S., Larson, R.H., Salmon, J.K., nan, S., 2022 6 .
Young, C., Batson, B., Bowers, K.J., Chao, J.C. Eastwood, M.P., 2008 . 49 ISCA 1
Anton 042-1057
Communications of the ACM 51(7) 91-99 " }
14