Adding MFMA Support to gem5
Marco Kurzynski Matthew D. Sinclair
ECE Department Computer Sciences Department
University of Central Florida University of Wisconsin-Madison
Orlando, FL, USA Madison, WI, USA
[email protected] [email protected] I. I NTRODUCTION designs, and others), GPUs (AMD and ARM models), accel-
erators [10], [18], various memories, on-chip interconnects,
arXiv:2501.18113v2 [cs.AR] 2 Feb 2025
In recent years machine learning (ML) has emerged as an
coherent caches, I/O devices, and many others. Moreover,
important application domain driving the requirements for fu-
gem5 provides two modes: Syscall Emulation (SE) and Full
ture systems. As DNNs get more sophisticated, their compute
System (FS). SE mode simulates an application’s user mode
requirements and the datasets they are trained on continue to
code in detail but emulates the OS instead of simulating it in
grow rapidly. For example, Gholami, et al. showed that com-
detail. Conversely, FS mode simulates both the OS and user
pute in the widely used Transformer networks [1] grew 750×
mode code in detail, allowing users to study the interaction
over 2 years [2], while other work projects DNN compute
between the OS and architecture.
and memory requirements to grow by 3× per year [3]–[5].
We previously enhanced and updated gem5’s GPU sup-
Given ML’s growing requirements and importance in driving
port [8] to add support for multi-chiplet systems [12] and ML
the requirements of system design, heterogeneous systems
workloads [9], [19] in SE mode. As a result, ML workloads
often add ML specific features (e.g., Matrix Core Engines
such as DNNMark [20] and DeepBench [21], which call ML
or TensorCores) to improve efficiency for ML workloads.
libraries directly (e.g., using MIOpen [22] or rocBLAS [23]
However, given ML’s voracious rate of growth and size, there
API calls) run in gem5’s SE mode. While this represents
is a growing challenge in performing early-system exploration
a significant step forward, this approach is insufficient for
to identify promising optimizations for future systems.
modern high-level frameworks such as PyTorch [24], [25] and
For example, the gem5 simulator is a popular cycle-level
TensorFlow [26]. Accordingly, we recently released support
system simulator for studying computer hardware and sys-
for running ML models in gem5’s FS mode [27], [28]. These
tems [6], [7]. It supports various CPUs, GPUs, and other
high-level frameworks utilize highly tuned libraries, which
important accelerators [8]–[12]. Unfortunately, the gem5 simu-
frequently utilize MFMAs, to run the high-level code at high
lator currently does not support modern hardware features. For
efficiency on a given target backend (e.g., an AMD GPU).
example, modern AMD and NVIDIA GPUs support special-
Thus, users can now run PyTorch and TensorFlow workloads
ized hardware units called Matrix Core Engines (MCEs [13],
on CPU-GPU systems in gem5 using modern versions (e.g.,
AMD) or TensorCores [14], [15] (NVIDIA). These specialized
v6) of AMD’s open-source ROCm stack. However, since ML
hardware matrix units efficiently perform various mathemati-
workloads that use PyTorch and TensorFlow frequently utilize
cal operations that are heavily utilized in state-of-the-art ML
MFMA instructions, their lack of effective, validated support
workloads. For example, on AMD MI200 [16] and MI300 [17]
in gem5 hinders users’ ability to run PyTorch and TensorFlow
GPUs, MCEs perform Matrix Fused Multiply Add (MFMA)
workloads in gem5.
instructions for a variety of precisions. However, since gem5
does not currently support MCEs, it is very difficult for users
to simulate state-of-the-art workloads in gem5. To address this III. A DDING MFMA INSTRUCTIONS TO GEM 5
shortcoming, in this work we have enhanced gem5’s GPU
MFMA instructions use dedicated MCEs in the SIMD units
model support to add MCEs for the MI200- and MI300-class
of each compute unit (CU). In our gem5 implementation we
GPUs gem5 supports. By adding this support, our changes
assume 1 MCE per SIMD unit in a CU for the simulated
enable running state-of-the-art ML workloads in gem5, as well
MI200 and MI300 GPUs. Since there are 4 SIMD units per
as examining how MCE optimizations impact the behavior of
CU by default, this corresponds to 4 MCEs per CU. We
future systems.
based this design on AMD’s reported MCE operations per
clock [17], as well as the number of SIMD units per CU
II. BACKGROUND
and CUs in MI200 and MI300 GPUs. In hardware, the MCE
The gem5 simulator is a widely used, open-source, cycle- are separate functional units (FUs) from the existing FU for
level computer system simulator. At its core gem5 contains transcendentals, vector ALU operations, vector loads/stores,
an event-driven simulation engine. On top of this simula- scalar memory operations, and other operations [29]. Thus, a
tion engine gem5 implements a large number of models for GPU CU can execute instructions that utilize these different
system components for CPUs (out-of-order designs, in-order FUs concurrently.
gem5 Code: In gem5 our MCE/MFMA GPU Feature Configuration
GPU Clock 1801 MHz
implementation is primarily located in
Total CUs 60
src/gpu-compute/compute_unit.cc Num SIMD units/CU 4
(timing implementation) and Max WF/SIMD unit 10
src/gpu-compute/scoreboard_check_stage.cc Vector/Scalar Reg. File Size / CU 256/12.5 KB
LI Instruction Cache / 4 CU 16 KB, 64B line, 8-way
(logic for issuing MFMA instructions and deter- LI Scalar Cache / 4 CU 16 KB, 64B line, 8-way
mining when MCEs are available). The functional L1 Data Cache / CU 16 KB, 64B line, 16-way
implementation of the MFMAs is located in L1 Instruction Cache Latency 40 cycles
src/arch/amdgpu/vega/insts/instructions.hh. L1 Data Cache Latency 140 cycles
L1 Scalar Latency 41 cycles
All matrix core instructions are composed of the operation LDS Size / CU 64 KB
D = C + A ∗ B where D and C are 4 × 4 matrices, and A and LDS Latency 65 cycles
B are 4 × 1 and 1 × 4 matrices, respectively. In AMD’s Vega L2 Cache 8 MB, 64B line, 32-way (16 banks)
L2 Latency 269 cycles
ISA, these instructions are of the form V_MFMA_[output L2 Write Policy Write-back with write allocate
type]_[M ]X[N ]X[K]_[B]B_[input type] where Main Memory 16 GB HBM2, 4H stacks
M , N , and K are dimensions and B is the number of these 1000 MHz, 64 banks
matrices that are multiplied. Different MFMA instructions im- Main Memory Latency 483 cycles
plement different shapes and blocks of these operations which TABLE I
S IMULATED BASELINE GPU PARAMETERS .
can be executed by a single matrix core. The larger the block
size, the smaller the dimensions of the product, and the larger
the block size and dimensions, the more cycles the instruction
takes to execute in the MCE (Table 27 of the MI300 ISA A. Differences Between MI200 & MI300 Support
manual [30]). In gem5, we use the NRDY_MATRIX_CORE Although both MI200- and MI300-class GPUs sup-
field per SIMD unit to determine when an MCE is available: port MCEs and have MFMA instructions, AMD mod-
the scoreboard checks that the appropriate number of cycles ified their support for matrix operations between these
for the given GPU model and instruction type have passed two GPU generations. For example, AMD added new
before allowing another MFMA to execute. MFMA instructions in MI300, including a 2-block variant
While a CU is utilizing an MCE functional unit, it can of v_mfma_f32_32x32x4_bf16 which takes the same
perform other work within the same thread (or in other threads number of cycles as the 1-block variant from MI200.1
from different wavefronts). For work within a thread, the work AMD also removed others; of the instructions benchmarked,
must not have any dependencies on the data produced by v_mfma_i32_16x16x16i8 was one such instruction re-
the MCE. Within a thread, modern AMD GPUs implement moved in the new architecture, which we used in our MI200
this by requiring that either the programmer (if using inlined benchmarks (Section IV). Moreover, MI300 GPUs have im-
assembly) or the compiler (all other scenarios) identify in- proved the latency (i.e., reduced the number of cycles) re-
dependent work to perform. Frequently, this takes the form quired to execute some MFMA instructions compared to
of software pipelining across multiple loop iterations that MI200 GPUs. This can be seen by comparing the Ex-
utilize MFMA instructions. However, if no independent work pected column in Tables II (MI200) and IV (MI300) for the
is available, then the compiler (or programmer) must insert MFMA instructions that are supported in both GPUs (e.g.,
NOPs (s_nop) to ensure no dependent work is performed fp32_16x16x16fp16). In the gem5 codebase, a compar-
until the MFMA instruction has completed (e.g., Table 36 of ison of supported and removed instructions, along with their
the MI300 ISA manual). Work from another wavefront (WF) cycle count, can be seen in the mfma_cycles lookup table
in the same work group (WG) or WFs from a different WG located in src/gpu-compute/compute_unit.cc.
do not have these requirements, but if they require an MCE
for their instructions and all MCEs on the CU are currently IV. M ETHODOLOGY
busy, they must wait, and the GPU will not schedule that WF
A. System Setup
until the hardware resources are available (i.e., the scoreboard
will prevent MFMAs using the same SIMD unit from being 1) gem5: Similar to prior work, we evaluate the MFMA in-
scheduled concurrently). Although we suspect real hardware structions using a tightly coupled CPU-GPU architecture with
MCE implementations have multi-stage pipelines (similar to a unified address space with shared memory and coherence
NVIDIA [31] and RISC-V [32] GPUs), the AMD compiler caches [33]. All CPU cores and GPU CUs are connected via
appears to behave as if MFMA instructions from a given a shared, inclusive L3, which also serves as the directory. We
WF cannot be pipelined in MCEs when adding independent use ROCm 6.2.2 [34] and gem5 v24.1 [6], [7], We configured
instructions or NOPs after MFMA instructions in a given this gem5 setup to mimic the real MI200 and MI300 GPUs
thread. Thus, we model this behavior in gem5 and focus on
1 In MCEs block is similar to batch size – it represents how many blocks
parallelism across WFs and MCEs within a CU. In the future,
within the MFMA instruction the MCE can concurrently process. Thus,
if this support changes, the gem5 MCE code can be easily an i-block variant means the MCE can process i MFMA blocks from the
changed to support pipelining MCEs. instruction concurrently.
(Section IV-A2) as closely as possible, using GPUFS mode. 1 asm volatile(
Table I summarizes the common key system parameters, which 2 "s_waitcnt lgkmcnt(0) & vmcnt(0)\n\t"
3 "s_memtime %
our group previously validated to tune gem5 relative to real 4 "s_waitcnt lgkmcnt(0)\n\t"
hardware [35]–[37]. 5 "v_mfma_f32_4x4x1f32 %
2) Real GPUs: We compared the timing behavior of 6 "v_mfma_f32_4x4x1f32 %
7 "v_mfma_f32_4x4x1f32 %
our new MFMA support in gem5 on the applications 8 "v_mfma_f32_4x4x1f32 %
™
in Section IV-B using AMD Instinct MI210 [16], [38] 9 "s_memtime %
MI300™ [17], [29] GPUs on the AMD AI & HPC Fund [39]. 10 "s_waitcnt lgkmcnt(0)\n\t"
11 : [start] "=r"(start), [end] "=r"(end), [D] "=v"
For each benchmark we ensured consistent results by running (d)
each test at least five times and averaging the results. 12 : [A] "v"(a), [B] "v"(b), [C] "v"(d));
13 total += end - start;
B. Benchmarks Listing 1. Example AMD GPU code snippet used to validate MFMA behavior
in gem5.
Although our group recently added support for running
large-scale ML workloads in PyTorch and TensorFlow in
gem5 [27], [28], in this work we focus on evaluating our
MFMA support using targeted microbenchmarks because they instructions. To overcome this, in each test we insert multiple
help us test and validate their behavior in isolation, unlike MFMA instructions back-to-back, with data dependencies
large-scale workloads where MFMA operations are intermixed between them. This ensures that the MFMA instructions
with numerous other instructions. Nevertheless, our MFMA must complete sequentially, which in turn forces the second
support also works for these larger workloads. To evaluate s_memtime instruction to wait (the GPU WF scheduler will
our added support for MI200 and MI300 GPUs, we designed stop scheduling subsequent instructions in a WF if there are
a series of microbenchmarks based on tests from AMD’s true data dependencies [8]). In turn this makes it easier to time
lab notes [40]. Tables II and III, summarize the MI200- the MFMA instructions on real GPUs. Note that since we in-
class GPU microbenchmarks, while Tables IV, and V sum- tentionally ignore the GPUs requirements for putting sufficient
marize the MI300-class GPU microbenchmarks. We selected independent work between MFMA instructions (Section III) in
these MFMA instructions because we found popular PyTorch these microbenchmarks, their functional output may be incor-
and TensorFlow ML workloads from GPT [41]–[43] and rect. AMD’s lab notes [40], which these microbenchmarks are
MLPerf [44], [45] use them and because they demonstrate based on, retains the compiler-inserted NOPs between MFMA
our gem5 support works for MFMA instructions of various instructions and validates that each MFMA instruction outputs
types and precisions. the correct functional answers.
In Listing 1, 4 MFMAs are inlined one after another, but
C. Validation only 2 are needed – since the s_memtime does not wait
To validate that our gem5 MCE implementation provides for the final MFMA to complete given there is not a data
high fidelity for MFMA instructions relative to real MI200 dependence. To calculate the time each MFMA takes, we use
and MI300 GPUs, we initially attempted to time the MFMA Equation 1:
instructions in the microbenchmarks from Section IV-B. How-
ever, we found that AMD’s HIP compiler would perform TMF MA = (Ttotal − Tmemtime − Tinst )/(NMF MA − 1) (1)
heavy software pipelining of these instructions, making timing
and validating the MFMA instructions in isolation challenging. where Ttotal is total from line 13, Tmemtime = 40 and
Thus, we designed a series of handwritten, inlined assembly Tinst = 4 were determined from microbenchmarks in prior
microbenchmarks that interleave a given MFMA instructions work [35]–[37], and NMF MA = 4 in our example kernel
with GPU timing instructions (s_memtime, which returns above. Tmemtime + Tinst is the time for the final MFMA to
a per CU 64-bit counter value representing the clock cycle execute, so they are subtracted from Ttotal . Since this includes
that the instruction occurred). As discussed in Section III, the final MFMA, we also subtract 1 from NMF MA when
the GPU’s internal scoreboard prevents MFMAs on the same dividing the total time to find individual cycle time (which
SIMD unit from being executed concurrently. Thus, by inter- the s_memtime does not accurately measure, as discussed
leaving MFMA instructions and timing instructions, we can above).
accurately time how long each individual MFMA takes in both
V. R ESULTS : MFMA I MPLEMENTATION VALIDATION
real hardware and gem5.
Listing 1 shows an example of our microbenchmarks. In A. Modeling Matrix Core
modern AMD GPUs the s_memtime instruction accesses For each MFMA instruction, we test two to five MFMA
the current clock cycle in the scalar cache. Since the scalar instructions back-to-back (Section IV-C) to ensure our support
pipeline is independent of the MCE pipeline (Section III), provides the expected behavior for various use cases on both
the s_memtime instruction is not guaranteed to wait for MI200 and MI300 GPUs. Then, we compared the timed
a preceding MFMA instruction to complete – e.g., if there outputs from gem5 for each GPU model to both real hardware
is only a single MFMA instruction between s_memtime and the corresponding tables in the ISA manuals (e.g., Table
NM F M A GPUs across 2-5 MFMAs. Although all number of back-to-
MFMA Expected
2 3 4 5
back MFMAs have low MAPE values, in general we find that
fp64 16x16x4fp64 32 32 32 32 32
fp32 4x4x1fp32 8 8 8 8 8
as the number of MFMAs increase, the MAPE relative to the
fp32 16x16x4fp32 32 32 32 32 32 real MI200 GPUs decreases from 2.3% MAPE for 2 MFMAs
fp32 16x16x16fp16 32 32 32 32 32 to 0.4% for 5 MFMAs. We believe this happens because
i32 16x16x16i8 32 32 32 32 32 additional MFMAs lead to longer sections between timing
fp64 4x4x4fp64 16 16 16 16 16
fp32 4x4x4fp16 8 8 8 8 8 measurements – reducing the likelihood that transient effects
TABLE II impact the timing results. However, for some MFMA instruc-
R EAL H ARDWARE MI200 LATENCY. I NSTRUCTIONS IN BLUE REQUIRED tions (shown in blue) additional care is required to obtain these
PADDING TO BE ACCURATELY MEASURED . results. Specifically, for these MFMA instructions we found
that padding was required to get accurate measurements. To
do this, we added multiple s_nop instructions before line 2 of
NM F M A
MFMA
2 3 4 5
Expected Listing 1. Padding prevents the GPU from needing to fetch a
fp64 16x16x4fp64 32 31.75 31.83 32.06 32 new instruction cache cacheline in the middle of our sequence
fp32 4x4x1fp32 8.25 7.88 7.92 7.94 8 of instructions that are timing the MFMA latency. We discuss
fp32 16x16x4fp32 32.25 32.5 32 32.125 32 the implications of this padding further in Section VI.
fp32 16x16x16fp16 32 31.5 32.33 32 32
i32 16x16x16i8 32 32 32.33 32 32
2) MI300: Tables IV and V similarly compare the MFMA
fp64 4x4x4fp64 16 16 16.33 16.25 16 latency on real MI300 GPUs and gem5’s simulated MI300
fp32 4x4x4fp16 7 7.5 7.67 8 8 model. Similar to the MI200 results, our gem5 MFMA imple-
TABLE III mentations again provide high fidelity relative to real MI300
MI200 LATENCY IN GEM 5. I NSTRUCTIONS IN BLUE REQUIRED PADDING GPUs: 1.3% MAPE. Once again, microbenchmarks with more
TO BE ACCURATELY MEASURED . MFMA instructions back-to-back tend to exhibit lower MAPE
than those with fewer MFMA instructions: 1.2% MAPE for
5 MFMAs versus 2.1% MAPE for 2 MFMAs. However,
NM F M A
MFMA
2 3 4 5
Expected in some situations we find that obtaining accurate latency
fp64 16x16x4fp64 32 32 32 32 32 measurements is difficult and can vary, even after padding the
fp32 4x4x1fp32 8 8 8 8 8 appropriate MFMA instructions to avoid unintended instruc-
fp32 16x16x4fp32 32 32 32 32 32 tion cache misses during the timing section. Here, we find
fp32 16x16x16fp16 16 16 16 16 16
fp64 4x4x4fp64 16 16 16 16 16
that the variability occurs when gem5 runs the CPU portion
fp32 4x4x4fp16 8 8 8 8 8 in KVM mode, which is non-deterministic. Nevertheless, these
TABLE IV differences are very small, and in general these results show
R EAL H ARDWARE MI300 LATENCY. I NSTRUCTIONS IN BLUE REQUIRED
PADDING TO BE ACCURATELY MEASURED .
that our gem5 MFMA implementations are well correlated
with real MI300 GPUs.
B. What-if Analysis
NM F M A
MFMA Expected
2 3 4 5 To further demonstrate the benefits of our MCE and MFMA
fp64 16x16x4fp64 32 32.13 32.08 31.94 32 additions to gem5, we added a new configuration parameter:
fp32 4x4x1fp32 7.5 8 7.92 8.06 8
--mfma-scale. By multiplying the default latency for a
fp32 16x16x4fp32 32 31.75 31.83 32 32
fp32 16x16x16fp16 15 16.5 16.33 16.25 16 given MFMA instruction (Section III) by this configuration
fp64 4x4x4fp64 16 16 16 16.25 16 parameter, users can increase or decrease the number of cycles
fp32 4x4x4fp16 8 8 7.67 8.25 8 a MFMA takes in a MCE. Thus, this parameter allows users
TABLE V to conduct what-if analysis to examine how potential im-
MI300 LATENCY IN GEM 5. I NSTRUCTIONS IN BLUE REQUIRED PADDING
TO BE ACCURATELY MEASURED .
provements to the MCEs affect an application’s performance.
Specifically, here we use our microbenchmarks with 2 MFMA
instructions back-to-back (Listing 1).
Table VI shows the effect of doubling MFMA latency
27 in the MI300 ISA manual [30]). Overall, these results show (by setting --mfma-scale=2) in gem5’s simulated MI300
that we provide accurate timing models for various MI200 and model. Intuitively, since our microbenchmarks put 2 MFMA
MI300 MFMA instructions, and we have incorporated these instructions back-to-back, doubling MFMA latency should
timing models for each tested MFMA instruction into gem5’s cause each microbenchmark to take twice as long compared
mainline public support. to the default of --mfma-scale=1. In general, the results
1) MI200: Tables II and III compare the MFMA latency in Table VI show that this is the case. However, not all
on real MI200 GPUs and gem5’s simulated MI200 model measurements show perfect, linear scaling: overall we obtain a
respectively. Overall, our gem5 MFMA implementations show 3% MAPE relative to doubling the latencies. This happens for
very high fidelity relative to real MI200 GPUs: 1.5% mean similar reasons to Section V-A: non-determinism from KVM.
absolute percentage error (MAPE) relative to real MI200 Nevertheless, these results further demonstrate the validity of
MFMA
--mfma-scale it requires updating every AMD GPU instruction in gem5
1 2
to check if it is using this addressing mode. Since there
fp64 16x16x4fp64 32 63
fp32 4x4x1fp32 7.5 16
are thousands of AMD GPU instructions, and relatively few
fp32 16x16x4fp32 32 65 MFMA instructions use this addressing mode, we instead
fp32 16x16x16fp16 15 33 focused on adding support for the MFMA instructions that
fp64 4x4x4fp64 16 32 do not use this addressing mode.
fp32 4x4x4fp16 8 16
Cache-Line Aligned MFMA Instruction Sequences: As
TABLE VI
MI300 SCALED LATENCY IN GEM 5 WITH NM F M A = 2. I NSTRUCTIONS discussed in Section V, the timing of some MFMA tests is
IN BLUE REQUIRED PADDING TO BE ACCURATELY MEASURED . sensitive to fetching new cache lines in the middle of our
timed sequence of instructions. Without cache line aligning
these tests, the gem5 and real GPU behavior often exhibit
our gem5 MFMA implementation and show how changes to unexpected behavior. However, since we observed that both
--mfma-scale affect TMF MA . gem5 and the real GPU exhibit similar unexpected behavior
However, it is important to note that we have complete when an additional cache line is fetched in the middle of the
control over the instructions being executed and measured region of interest, we do not believe changes are required to
in our microbenchmarks, due to our use of inlined assembly. gem5. Instead, if users observe unexpected results with specific
This allowed us to observe near perfect, linear scaling in our numbers of MFMA instructions and this significantly impacts
MFMA what-if analysis. In real workloads the compiler is the overall behavior of the workload, it is recommended they
responsible for enforcing timing conditions, making it more investigate the cache line alignment of the region of interest.
difficult to remove NOP instructions as MFMA latency scales.
Thus, it is important to also evaluate the impact of this VII. C ONCLUSION
parameter on a real workload. We discuss the implications As ML workloads increasingly drive the requirements of
of this further in Section VI. future systems, it is imperative that state-of-the-art tools like
VI. I MPLEMENTATION L IMITATIONS gem5 evolve with them. Without significant improvements,
researchers will be unable to perform early stage co-design for
Removing NOPs: As discussed in Sections III and V, AMD’s
these important workloads while being confident the results are
compiler-based approach to enforcing correctness with MFMA
representative. In particular, a major shortcoming of gem5’s
instructions introduces challenges when performing what-if
support for ML workloads is its lack of support for MCEs. In
analysis in simulators such as gem5. Consequently, scaling the
this work, we rectify this issue and add support for MCEs and
latency of MFMA instructions in gem5 without corresponding
their MFMA instructions in modern MI200 and MI300 AMD
changes to the compiler to change how much independent
GPUs in gem5. Our results show that our support provides
work must be inserted between MFMA instructions in the
high fidelity MFMA support for both MI200 and MI300
same WF will likely not result in linear scaling like we ob-
GPUs. For a wide variety of MFMA instructions, our gem5
served in Section V-B. This is a limitation of our MCE support
support provides identical or near identical latencies: 1.5%
in gem5. To overcome this, users have two options: they can
and 1.3% MAPE relative to real MI200 and MI300 GPUs,
either a) write inlined GPU assembly snippets that reduce the
respectively. Moreover, our results also show how researchers
amount of independent work between MFMAs to mirror the
can leverage this support to perform what-if analysis by scaling
changes in --mfma-scale, and then use these snippets in
the performance of MFMA operations in gem5. Overall, this
larger workloads or b) rewrite the compiler to adjust how
work is a significant step forward and enables researchers to
much independent work must be placed between MFMA
perform rapid, high fidelity experimentation for a variety of
instructions in the same WF based on the desired MFMA
modern, important GPU workloads.
scaling factor. Crucially, AMD’s ROCm [34] GPU stack is
open source, enabling sufficiently motivated researchers to
ACKNOWLEDGMENTS
make the necessary changes to the compiler. Obviously, both
of these options also have downsides (e.g., requiring compiler This work is supported in by the National Science Founda-
changes or inlined assembly), but regardless it is important tion CSSI grant Frameworks-2311889. We also thank Matt
that users understand the implications of how AMD enforces Poremba at AMD Research for his advice, which greatly
correctness for MFMA instructions. improved the quality of this work.
MFMA Instruction Failures: AMD recently introduced a
new access mode: s_set_gpr_idx, which utilizes a sep- R EFERENCES
arate (GPR) addressing mode to improve the performance
[1] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan-
of specific code sequences. However, currently this ad- zaro, “Megatron-LM: Training Multi-Billion Parameter Language Mod-
dressing mode is unsupported in gem5. As a result, some els Using Model Parallelism,” CoRR, vol. abs/1909.08053, 2019.
MFMA instructions (e.g., v_mfma_fp32_32x32x8fp16 [2] A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney,
and K. Keutzer, “AI and Memory Wall,” IEEE Micro,
and v_mfma_fp32_32x32x1fp32) are unsupported. Al- vol. 44, no. 03, pp. 33–39, May 2024. [Online]. Available:
though it is possible to implement this addressing mode, https://doi.ieeecomputersociety.org/10.1109/MM.2024.3373763
[3] S. Naffziger, N. Beck, T. Burd, K. Lepak, G. H. Loh, M. Subramony, K. Hamidouche, S. Hossamani, W. Huang, M. Islam, N. Jayasena,
and S. White, “Pioneering Chiplet Technology and Design for the J. Kalamatianos, O. Kayiran, J. Kotra, A. Lee, D. Lowell, N. Madan,
AMD EPYC™ and Ryzen™ Processor Families : Industrial Product,” in A. Majumdar, N. Malaya, S. Manne, S. Mashimo, D. McDougall,
ACM/IEEE 48th Annual International Symposium on Computer Archi- E. Mednick, M. Mishkin, M. Nutter, I. Paul, M. Poremba, B. Potter,
tecture, ser. ISCA. New York, NY, USA: Association for Computing K. Punniyamurthy, S. Puthoor, S. E. Raasch, K. Rao, G. Rodgers,
Machinery, 2021, pp. 57–70. M. Scrbak, M. Seyedzadeh, J. Slice, V. Sridharan, R. van Oostrum,
[4] N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, E. van Tassell, A. Vishnu, S. Wasmundt, M. Wilkening, N. Wolfe,
N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, M. Wyse, A. Yalavarti, and D. Yudanov, “A Research Retrospective
Z. Zhou, and D. A. Patterson, “TPU v4: An Optically Reconfigurable on AMD’s Exascale Computing Journey,” in Proceedings of the 50th
Supercomputer for Machine Learning with Hardware Support for Annual International Symposium on Computer Architecture, ser. ISCA.
Embeddings,” in Proceedings of the 50th Annual International New York, NY, USA: Association for Computing Machinery, 2023.
Symposium on Computer Architecture, ser. ISCA. New York, NY, [Online]. Available: https://doi.org/10.1145/3579371.3589349
USA: Association for Computing Machinery, 2023. [Online]. Available: [17] A. Smith, G. H. Loh, M. J. Schulte, M. Ignatowski, S. Naffziger,
https://doi.org/10.1145/3579371.3589350 M. Mantor, N. Kalyanasundharam, V. Alla, N. Malaya, J. L. Greathouse,
[5] N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, E. Chapman, and R. Swaminathan, “Realizing the AMD Exascale
G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil, Heterogeneous Processor Vision : Industry Product,” in 51st ACM/IEEE
S. Prasad, C. Young, Z. Zhou, and D. Patterson, “Ten Lessons from Annual International Symposium on Computer Architecture, ser. ISCA.
Three Generations Shaped Google’s TPUv4i,” in Proceedings of the Piscataway, NJ, USA: IEEE, 2024, pp. 876–889. [Online]. Available:
48th Annual International Symposium on Computer Architecture, ser. https://doi.org/10.1109/ISCA59077.2024.00068
ISCA. Piscataway, NJ, USA: IEEE Press, 2021, p. 1–14. [Online]. [18] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks, “Co-
Available: https://doi.org/10.1109/ISCA52012.2021.00010 designing Accelerators and SoC Interfaces using gem5-Aladdin,” in 49th
[6] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, Annual IEEE/ACM International Symposium on Microarchitecture, ser.
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, MICRO, 2016, pp. 1–12.
M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” [19] B. R. Bruce, A. Akram, H. Nguyen, K. Roarty, M. Samani, M. Fariborz,
ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, T. Reddy, M. D. Sinclair, and J. Lowe-Power, “Enabling Reproducible
2011. and Agile Full-System Simulation,” in IEEE International Symposium
[7] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, on Performance Analysis of Systems and Software, ser. ISPASS. Los
M. Andreozzi, A. Armejach, N. Asmussen, S. Bharadwaj, G. Black, Alamitos, CA, USA: IEEE Computer Society, 2021, pp. 183–193.
G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillon, L. Chen, [20] S. Dong and D. Kaeli, “DNNMark: A Deep Neural Network Benchmark
N. Derumigny, S. Diestelhorst, W. Elsasser, M. Fariborz, A. Farmahini- Suite for GPUs,” in Proceedings of the General Purpose GPUs, ser.
Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass, GPGPU. New York, NY, USA: ACM, 2017, pp. 63–72. [Online].
B. Hanindhito, A. Hansson, S. Haria, A. Harris, T. Hayes, A. Herrera, Available: http://doi.acm.org/10.1145/3038228.3038239
M. Horsnell, S. A. R. Jafri, R. Jagtap, H. Jang, R. Jeyapaul, T. M. [21] S. Narang and G. Diamos, “An Update to DeepBench with a Focus
Jones, M. Jung, S. Kannoth, H. Khaleghzadeh, Y. Kodama, T. Krishna, on Deep Learning Inference,” https://svail.github.io/DeepBench-update/,
T. Marinelli, C. Menard, A. Mondelli, T. Mück, O. Naji, K. Nathella, 2017.
H. Nguyen, N. Nikoleris, L. E. Olson, M. Orr, B. Pham, P. Prieto, [22] J. Khan, P. Fultz, A. Tamazov, D. Lowell, C. Liu, M. Melesse, M. Nand-
T. Reddy, A. Roelke, M. Samani, A. Sandberg, J. Setoain, B. Shingarov, himandalam, K. Nasyrov, I. Perminov, T. Shah, V. Filippov, J. Zhang,
M. D. Sinclair, T. Ta, R. Thakur, G. Travaglini, M. Upton, N. Vaish, J. Zhou, B. Natarajan, and M. Daga, “MIOpen: An Open Source Library
I. Vougioukas, Z. Wang, N. Wehn, C. Weis, D. A. Wood, H. Yoon, and For Deep Learning Primitives,” CoRR, vol. abs/1910.00078, 2019.
Éder F. Zulian, “The gem5 simulator: Version 20.0+,” 2020. [23] AMD, “rocBLAS Library,” https://rocm-documentation.readthedocs.io/en/latest/ROCm
[8] A. Gutierrez, B. M. Beckmann, A. Dutu, J. Gross, M. LeBeane, 2024.
J. Kalamatianos, O. Kayiran, M. Poremba, B. Potter, S. Puthoor, M. D. [24] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky,
Sinclair, M. Wyse, J. Yin, X. Zhang, A. Jain, and T. Rogers, “Lost in B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia,
Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng,
Level,” in IEEE International Symposium on High Performance Com- J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar,
puter Architecture, ser. HPCA. Washington, DC, USA: IEEE Computer L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu,
Society, 2018, pp. 608–619. C. K. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim,
[9] K. Roarty and M. D. Sinclair, “Modeling Modern GPU Applications in M. Y. Siraichi, H. Suk, S. Zhang, M. Suo, P. Tillet, X. Zhao,
gem5,” in 3rd gem5 Users’ Workshop, June 2020. E. Wang, K. Zhou, R. Zou, X. Wang, A. Mathews, W. Wen,
[10] S. Rogers, J. Slycord, M. Baharani, and H. Tabkhi, “gem5-SALAM: G. Chanan, P. Wu, and S. Chintala, “PyTorch 2: Faster Machine
A System Architecture for LLVM-based Accelerator Modeling,” in Learning Through Dynamic Python Bytecode Transformation and
53rd Annual IEEE/ACM International Symposium on Microarchitecture, Graph Compilation,” in Proceedings of the 29th ACM International
2020, pp. 471–482. Conference on Architectural Support for Programming Languages and
[11] Z. Spencer, S. Rogers, J. Slycord, and H. Tabkhi, “Expanding Hardware Operating Systems, Volume 2, ser. ASPLOS ’24. New York, NY, USA:
Accelerator System Design Space Exploration with gem5-SALAMv2,” Association for Computing Machinery, 2024, p. 929–947. [Online].
Journal of Systems Architecture, vol. 154, p. 103211, 2024. Available: https://doi.org/10.1145/3620665.3640366
[12] B. W. Yogatama, M. D. Sinclair, and M. M. Swift, “Enabling Multi-GPU [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
Support in gem5,” in 3rd gem5 Users’ Workshop, June 2020. A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
[13] G. Schieffer, D. A. De Medeiros, J. Faj, A. Marathe, and I. Peng, “On PyTorch,” in NeurIPS-W, 2017.
the Rise of AMD Matrix Cores: Performance, Power Efficiency, and [26] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Programmability,” in IEEE International Symposium on Performance Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
Analysis of Systems and Software, ser. ISPASS, 2024, pp. 132–143. A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,
[14] J. Choquette, O. Giroux, and D. Foley, “Volta: Performance and Pro- J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah,
grammability,” IEEE Micro, vol. 38, no. 2, pp. 42–52, 2018. M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
[15] M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling Deep Learning V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden,
Accelerator Enabled GPUs,” in IEEE International Symposium on M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-
Performance Analysis of Systems and Software, ser. ISPASS, IEEE. Scale Machine Learning on Heterogeneous Systems,” 2015, software
Piscataway, NJ, USA: IEEE Press, 2019, pp. 79–92. available from tensorflow.org. [Online]. Available: http://tensorflow.org/
[16] G. H. Loh, M. J. Schulte, M. Ignatowski, V. Adhinarayanan, S. Aga, [27] V. Ramadas, M. Poremba, B. Beckmann, and M. D. Sinclair, “Simulation
D. Aguren, V. Agrawal, A. M. Aji, J. Alsop, P. Bauman, B. M. Support for Fast and Accurate Large-Scale GPGPU & Accelerator
Beckmann, M. V. Beigi, S. Blagodurov, T. Boraten, M. Boyer, W. C. Workloads,” in Third Workshop on Open-Source Computer Architecture
Brantley, N. Chalmers, S. Chen, K. Cheng, M. L. Chu, D. Cownie, Research, ser. OSCAR, June 2024.
N. Curtis, J. Del Pino, N. Duong, A. Duundefinedu, Y. Eckert, [28] V. Ramadas and M. D. Sinclair, “Simulating Machine Learning Models
C. Erb, C. Freitag, J. L. Greathouse, S. Gurumurthi, A. Gutierrez, at Scale,” in SRC TECHCON, September 2024.
[29] AMD, “AMD CDNA™ 3 Architecture,” 2023. [Online]. Available: V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk,
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
D. Mély, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo,
[30] Advanced Micro Devices (AMD), AMD Instinct MI300 Instruction H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo,
Set Architecture Reference Guide, June 2024. [Online]. Available: A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos,
M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov,
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf
[31] J. Choquette, “NVIDIA Hopper H100 GPU: Scaling Performance,” H. P. de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. H. Pong,
IEEE Micro, vol. 43, no. 03, pp. 9–17, May 2023. [Online]. Available: T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae,
https://doi.ieeecomputersociety.org/10.1109/MM.2023.3256796 A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted,
[32] A. Nada, G. M. Sarda, and E. Lenormand, “Cooperative Warp Execution H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry,
in Tensor Core for RISC-V GPGPU,” in IEEE 31st International H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard,
Symposium on High Performance Computer Architecture, ser. HPCA, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler,
2025. M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song,
[33] P. Dalmia, R. Shashi Kumar, and M. D. Sinclair, “CPElide: Efficient N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak,
Multi-Chiplet GPU Implicit Synchronization,” in Proceedings of 57th M. B. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle,
IEEE/ACM International Symposium on Microarchitecture, ser. MICRO. N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya,
Los Alamitos, CA, USA: IEEE Computer Society, 2024. C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward,
[34] AMD, “ROCm: Open Platform For Development, J. Wei, C. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng,
Discovery and Education around GPU Computing,” M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman,
https://gpuopen.com/compute-product/rocm/, 2021. S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan,
[35] C. Jamieson, A. Chandrashekar, I. McDougall, and M. D. Sinclair, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng,
“GAP: gem5 GPU Accuracy Profiler,” in 4th gem5 Users’ Workshop, J. Zhuang, W. Zhuk, and B. Zoph, “GPT-4 Technical Report,” 2024.
June 2022. [Online]. Available: https://arxiv.org/abs/2303.08774
[36] V. Ramadas, D. Kouchekinia, N. Osuji, and M. D. Sinclair, “Closing [44] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J.
the Gap: Improving the Accuracy of gem5’s GPU Models,” in 5th gem5 Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka,
Users’ Workshop, June 2023. C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S.
[37] V. Ramadas, D. Kouchekinia, and M. D. Sinclair, “Further Closing the Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar,
GAP: Improving the Accuracy of gem5’s GPU Models,” in 6th Young D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius,
Architects’ Workshop, ser. YArch, April 2024. C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao,
[38] AMD, “Introducing AMD CDNA™ 2 Architecture,” F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada,
B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou, “MLPerf Inference
https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/white-papers/amd-cdna2-white-paper.pdf,
2022. Benchmark,” in 2020 ACM/IEEE 47th Annual International Symposium
[39] ——, “AMD AI & HPC Fund: Accelerate Your on Computer Architecture, ser. ISCA, 2020, pp. 446–459.
Research with AMD,” 2024. [Online]. Available: [45] P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius,
https://www.amd.com/en/corporate/hpc-fund.html D. Patterson, H. Tang, G.-Y. Wei, P. Bailis, V. Bittorf, D. Brooks,
[40] ——, “AMD lab notes,” https://github.com/amd/amd-lab-notes/, 2024. D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang,
[41] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, D. Kang, D. Kanter, N. Kumar, J. Liao, D. Narayanan, T. Oguntebi,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- G. Pekhimenko, L. Pentecost, V. Janapa Reddi, T. Robie, T. St John, C.-J.
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, Wu, L. Xu, C. Young, and M. Zaharia, “MLPerf Training Benchmark,” in
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, Proceedings of Machine Learning and Systems, ser. MLSys, I. Dhillon,
J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Papailiopoulos, and V. Sze, Eds., vol. 2, 2020, pp. 336–349.
D. Amodei, “Language Models are Few-Shot Learners,” in Advances
in Neural Information Processing Systems, ser. NeurIPS, H. Larochelle,
M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33. Red
Hook, NY, USA: Curran Associates, Inc., 2020, pp. 1877–1901.
[42] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
“Language Models are Unsupervised Multitask Learners,” OpenAI Blog,
vol. 1, no. 8, 2019.
[43] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,
R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao,
M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro,
C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman,
G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell,
A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang,
F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen,
B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier,
Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar,
D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou,
D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford,
L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh,
R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene,
J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He,
M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele,
B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain,
J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun,
T. Kaftan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar,
T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner,
J. Kiros, M. Knight, D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich,
A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan,
T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin,
M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini,
S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne,
B. McGrew, S. M. McKinney, C. McLeavey, P. McMillan, J. McNeil,
D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin,