Energy Cost Modelling For Optimizing Large Language Model Inference On Hardware Accelerators
Energy Cost Modelling For Optimizing Large Language Model Inference On Hardware Accelerators
Robin Geens† , Man Shi† , Arne Symons† , Chao Fang†,‡ , Marian Verhelst†
†
MICAS, KU Leuven, Belgium
‡
School of Electronic Science and Engineering, Nanjing University, China
[email protected]
Abstract—The rise of Large Language Models (LLMs) has architecture configurations. It is hence important to conduct
significantly escalated the demand for efficient LLM inference, rapid modeling and evaluation of LLM workloads on a va-
primarily fulfilled through cloud-based GPU computing. This riety of hardware accelerator topologies to locate the energy
approach, while effective, is associated with high energy con-
sumption resulting in large operating expenses and considerable bottlenecks during the early design stages, and facilitate the
carbon footprints. In the meantime, growing privacy concerns discovery of optimal energy-efficient design points.
advocate for inference on edge devices, which are constrained Several modeling tools have been developed to enable
by a limited battery capacity. Both cloud and edge scenarios rapid early-stage analysis and help understand the impact of
necessitate energy-efficient LLM inference strategies. hardware parameters on the performance of LLM inference
This paper addresses the urgent need for energy-efficient
inference by proposing an open-source framework designed to [5]–[7]. Despite their utility towards latency assessments, these
model LLM workloads on dedicated accelerators. Our frame- tools lack quantitative energy estimations. As such, the chal-
work facilitates early identification of energy bottlenecks through lenge of identifying energy-efficient design points on dedicated
rapid modeling of the execution efficiency of a wide range accelerators remains unaddressed. To address these limitations,
of LLMs on diverse hardware architectures. Key innovations we propose an open-source and easy-to-use framework that
include a PyTorch-based generalized LLM template to easily
generate custom workload graphs, extensions of the ZigZag aims at exposing the energy bottlenecks of LLM inference on
design space exploration framework and techniques to signifi- custom hardware architectures.
cantly speed up simulation time at a negligible loss of accuracy. In summary, the main contributions of this paper are as
Using a representative hardware architecture, we conduct three follows:
case studies to reveal critical energy bottlenecks in Llama2-7B 1) We develop a user-friendly framework to model LLM
inference, revealing that 1) memory-bound computing in the
decode stage is detrimental not only for the latency, but also inference energy and latency on user-defined hardware
for the energy cost; 2) aggressive weight-only quantization can accelerator architectures.
reduce the energy cost by 4.6× and shift the bottleneck from 2) We introduce techniques to save simulation time by
weight fetching to the attention mechanism; 3) in edge scenarios, reducing the workload graph and the required number
the relative energy cost of the prefill stage is more significant, of simulations, and adjusting the results afterwards.
encouraging efforts to optimize both prefill and decode stage. Our
framework is available open-source at github.com/KULeuven- 3) We deploy the framework to uncover often-overlooked
MICAS/zigzag-llm. energy bottlenecks in the inference of popular LLMs,
providing insights for efficiency improvements.
I. I NTRODUCTION
The rapid advancement of generative AI, particularly Large
Language Models (LLMs), has significantly increased the
demand for LLM inference. This growing demand is predom-
inantly met through cloud-based GPU computing [1]. While
this approach supports low-latency applications, it comes with
high energy consumption. Furthermore, privacy concerns are
prompting LLM inference on edge devices, which also face
stringent energy constraints due to limited battery capacity [2].
These trends emphasize the need for energy-efficient LLM
inference in the cloud and at the edge. Fig. 1. Diagram of a single layer of a decoder-only transformer.
To improve efficiency, the LLM workloads are increasingly
II. BACKGROUND
executed on dedicated hardware accelerators [3]. However,
the design of these accelerators is challenging, as the energy A. Large Language Models
consumption bottlenecks vary strongly in function of workload Key Components. Figure 1 illustrates the fundamental
characteristics [4], leading to different optimal accelerator computationally-intensive components of transformer-based
Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.
A. Workload Graph Generation linear in terms of sequence length , we select the token halfway
Our simulation engine accepts workloads in Open Neural the decode sequence as candidate for simulation. This way,
Network Exchange (ONNX) [19] format. ONNX represen- the overestimation of energy and latency for the tokens earlier
tations are available for many workloads as it is a widely in the decode stage can even out the underestimation of the
used standard. In order to have full control of the internals tokens that come after.
of the LLMs to be modeled, the framework houses an LLM Furthermore, to model a full LLM inference for a realistic
implementation programmed in PyTorch [20] with an auto- scenario, we assume that half of the model’s sequence length
matic conversion to ONNX. The implementation can easily consist of input tokens that are processed in the prefill stage,
be parameterized to correctly represent many popular LLM and the other half are output tokens generated in the decode
models. Moreover, this enables us to force the use of KV- stage. This common scenario occurs, for example, when the
caching directly into the workload graph, which would be LLM is used to rewrite a piece of text: the prefill stage is used
impossible when using the available ONNX graphs. In this to load the input text, and the altered text is generated in the
case, the input sequence length is forced to 1 and the key and decode stage.
value caches are treated as inputs of the network graph (of a Thus, to simulate the decode phase, we construct the sce-
chosen size, see Section III-C). A separate ONNX graph is nario where the KV-cache is filled up by the previous ( 34 L−1)
generated for the prefill and decode stage, as shown in Figure tokens, the ( 43 L)-th token is provided as input in order to
2. Weight and activation quantization is supported by manually generate the ( 34 L + 1)-th, where L is the model’s context
attaching a custom attribute to each layer in the ONNX graph. window size. This principle is demonstrated in Figure 3. The
results of this single simulation are then multiplied with L2 to
B. Elimination of Repeated Layers represent the full decode process.
Transformer architectures exhibit a high level of regularity. We validate this approach by simulating the full auto-
Though the LLM may have a large number of parameters, the regressive decode phase of Llama2-7B [21], token by token.
corresponding transformer architecture repeats the exact same The sum of inference energy for each consecutive energy
layer many times (but with different weights), and each layer is compared to the result from the proposed approach. Our
consists of many identical heads. To greatly save simulation approach exhibits a relative error of 1.0 ∗ 10−5 and speeds up
time, we parameterize the in-house PyTorch model with only the decode simulation by a factor of 2048.
a single layer and a single head before exporting it to ONNX.
D. Simulation Engine
The result is a trivial ONNX graph that contains all operations,
but only for a single layer and head. In accordance to Figure 1, The simulation itself is performed by ZigZag [4], which
the query, value, key and output projections are each combined we extend with support for LLM workloads and quantization.
for all heads and represented as a single matrix multiplication. ZigZag is capable of finding the optimal spatial and temporal
Afterwards, the simulation results are multiplied by the correct unrolling for the given workload and architecture, in terms of
number of layers and heads to obtain the inference results for the chosen criterion (energy, delay, or energy-delay product).
the full model. This technique drastically speeds up simulation One important remark is that while non-linear layers (e.g.
time while maintaining accuracy. Furthermore, it ensures that SoftMax, GeLU and LayerNorm) are part of the ONNX graph,
the simulation time is only marginally dependent on the model ZigZag skips them during simulation. However, since the
size. For reference, the hardware performance simulation of computational complexity of LLM inference is dominated by
both the OPT-125M and GPT3-175B model both take around matrix multiplications, this simplification is expected to have
20 seconds on a medium-sized server. minimal impact on the results. Moreover, hardware accelera-
tors often have built-in support for the non-linear layers such
C. Modelling Decode Phase with KV-caching that they can be computed in parallel and at a low energy cost.
The decode phase of LLM inference is auto-regressive,
meaning that all tokens after the prefill stage are gener- E. Hardware Architecture Definition
ated one-by-one. Simulating the full decode process can be In order to model LLM inference on hardware accelerators
achieved by running the simulation for each generated token and derive practically usable conclusions and insights, the
and only providing a single token as input in each run, which hardware model should be realistic and representative for
is a cumbersome and time-intensive process. real-life use cases. For energy-efficient inference, accurately
Instead, we approximate the energy and latency results of modelling the memory hierarchy and the corresponding access
the full decode phase by simulating the generation of a single energy costs are especially important.
token somewhere along the decode sequence, and multiplying As such, our work employs the ZigZag hardware architec-
found energy and latency results by the number of tokens. This ture model. This model incorporates: 1) the compute core, in
comes at the risk of over-generalizing the workload’s temporal the form of a processing element (PE) array with a flexible
mapping: generating the first token (when the KV-cache is shape and local registers of variable size; 2) a hierarchy of
filled by the prefill phase only) might result a different memory memories; 3) the interconnection pattern between the PEs and
access pattern than the last token (when the KV-cache is full). the memory hierarchy. At each level, the memory hierarchy
Since the number of operations of the attention operation is has a variable capacity, bandwidth, number of read/write ports,
Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.
capacity, and memory bandwidth, as shown in Figure 4. The
specifications for the cloud accelerator align with those used
in a previous study [24], while the edge specifications are
matched with the resources of the Google Edge TPU [25].
Table II lists the energy costs of the multiply-accumulate
(MAC) operation along with different memory read and write
costs utilized in the case studies. As vendors have strict NDAs
for different technologies, we approximate the energy costs
based on a 45nm process [22]. These energy costs can easily be
customized by the user as framework inputs. All case studies
assume the same MAC energy cost, regardless of datatype.
Fig. 4. Hardware diagram of the target platform to model LLM inference. B. Energy and Latency Analysis on Cloud Accelerator
As an initial demonstration, we analyze the energy and
and targeted operands. An example of a resulting accelerator
latency breakdown of a full inference (prefill and decode) of
architecture is shown in Figure 4.
Llama2-7B [21] on the Cloud Accelerator. The precision of
IV. C ASE STUDIES the weights, activations and intermediate output values is set
To demonstrate the practical applicability of the framework, to 32 bits (e.g. FP32). The results are shown in Figure 5.
we conduct three case studies and extract hardware-oriented From the latency breakdown, we observe that the prefill
insights from each of them. In all cases, we use Llama2-7B stage achieves nearly ideal computational latency with neg-
[21] as workload. This openly available and medium-sized ligible memory stalls. Its efficiency stems from operating
LLM model is often used as a representative LLM workload in a compute-bound regime, consistent with roofline model
in literature. For the sake of reducing the design space, we fix analysis [5]. This underscores the prefill stage’s ability to
the dataflow to the commonly used weight-stationary dataflow leverage extensive parallelism, which is inherently facilitated
for both prefill and decode stages. The batch size is set to 8 to by cloud accelerators.
mirror a practical setting while the resulting energy and latency During the decode stage, the accelerator operates in the
values are normalized to a single batch. Furthermore, energy- memory-bound region and the latency is dominated by mem-
delay product (EDP) is adopted as the optimization criterion ory stalls, i.e. the time that the PE array cannot be fully
in all experiments, enabling the identification of design points utilized because the data still needs to be transferred from
that offer a balanced trade-off between latency and energy. memory. The low arithmetic intensity stems from the unusual
To visualize the results, the modeled layers with similar compute pattern during the auto-regressive decode stage with
compute behaviour are grouped and merged into a single KV-caching. For the next-token prediction, only a single, new
category. We define three groups: linear projection, for all token is provided as input, yet all model weights need to be
query, key, value and output projections; Attention, for the fetched from memory.
Q × K T and S × V layers; and feed-forward network (FFN), The results clearly show that operating in the memory-
for the two layers of the fully-connected networks after each bound region not only penalizes the inference latency, but it
set of heads and the gate layer in Llama models. The three greatly increases the inference energy cost since more higher-
groups are also shown in Figure 1. level, expensive memory accesses are incurred. As such, the
main energy and latency bottleneck in this case is the fetching
TABLE II of high-precision weights from higher-level memories.
E NERGY COSTS ASSUMED FOR THE GENERIC HARDWARE ARCHITECTURE
A. Hardware Setup
The case studies use a generalized hardware architecture
[23] that is representative for state-of-the-art accelerators for
both cloud and edge scenarios. The cloud and edge acceler-
ators employ the same architecture template, with specifica- Fig. 5. The energy cost and latency performance of Llama2-7B on the Cloud
tions in terms of compute capacity (PE array size), memory Accelerator of Figure 4.
Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Impact of different quantization schemes on the energy and the latency cost of Llama2-7B inference on the Cloud Accelerator of Figure 4.
Another observation is that the attention module is rather in terms of energy and latency due to the massive parameter
inconsequential in comparison to weight-based layers. This size. With the 1-bit weight quantization, the significant energy
effect is even more pronounced in the larger, commercially cost of parameter movements is alleviated, reducing the total
used models such as GPT3-175B [26]. In larger models, energy cost by 4, 6× and shifting the bottleneck entirely to
the problem size of the weight-based significantly increases full-precision input activation fetching in the attention layers.
whereas the sequence length, which determines the complexity In conclusion, weight-only quantization helps to reduce both
of the attention layers, stays in the same order of magnitude. energy and latency. Aggressive weight-only quantization goes
This observation differs from other works in literature that as far as to shift the bottleneck to the full-precision attention
suggest the attention module dominates the inference cost due module.
to its quadratic complexity in function of the sequence length,
prompting special hardware-enabled techniques to accelerate
this module [27].
Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.
In terms of energy, the performance of both accelerators is [7] H. Zhang et al., “A hardware evaluation framework for large language
comparable, though the Edge Accelerator performs slightly model inference,” 2023. [Online]. Available: https://arxiv.org/abs/2312.
03134
worse. The MAC and low-level memory access energy is [8] M. Shoeybi et al., “Megatron-lm: Training multi-billion param-
exactly the same, since this only depends on the problem size eter language models using model parallelism,” arXiv preprint
and cost per bit for MAC unit and register. There are two arXiv:1909.08053, 2019.
[9] Y. Qin et al., “Fact: Ffn-attention co-optimized transformer architecture
main differences in the energy profile: 1) the smaller PE array with eager correlation prediction,” in Proceedings of the 50th Annual
allows for less data reuse at the register level, incurring more International Symposium on Computer Architecture, 2023, pp. 1–14.
SRAM memory accesses to replace the values in the registers; [10] M. Shi et al., “Bitwave: Exploiting column-based bit-level sparsity for
deep learning acceleration,” in 2024 IEEE International Symposium on
2) since the SRAM is smaller, less data reuse is possible at High-Performance Computer Architecture (HPCA). IEEE, 2024, pp.
this memory level, incurring more DRAM accesses. 732–746.
In the Cloud Accelerator, the energy cost of the prefill stage [11] H. Fan et al., “Adaptable butterfly accelerator for attention-based nns
via hardware and algorithm co-design,” in 2022 55th IEEE/ACM Inter-
is dominated by register accesses, indicating that the data reuse national Symposium on Microarchitecture (MICRO). IEEE, 2022, pp.
is optimal. In the Edge Accelerator, the energy cost of the 599–615.
SRAM component becomes non-negligible in the prefill stage [12] C. Fang et al., “An algorithm–hardware co-optimized framework for
accelerating N:M sparse transformers,” IEEE Transactions on Very Large
due to the frequent replacement of the intermediate output Scale Integration (VLSI) Systems, vol. 30, no. 11, pp. 1573–1586, 2022.
value in the output register. Thus, compared to the Cloud [13] W. Shao et al., “OmniQuant: Omnidirectionally Calibrated Quantization
Accelerator, the relative contribution of the prefill stage is for Large Language Models,” in The Twelfth International Conference
on Learning Representations (ICLR), 2024.
larger in the Edge Accelerator, highlighting the importance [14] J. Lin et al., “AWQ: Activation-aware Weight Quantization for LLM
of prefill stage optimizations in edge scenarios. Compression and Acceleration,” in The Seventh Annual Conference on
Machine Learning and Systems (MLSys), 2024.
V. C ONCLUSION [15] E. Frantar et al., “Optq: Accurate quantization for generative pre-trained
transformers,” in The Eleventh International Conference on Learning
This paper presents a novel open-source and easy-to-use Representations (ICLR), 2023.
[16] S. Zeng et al., “Flightllm: Efficient large language model inference
framework that addresses the lack of tools for rapid modeling with a complete mapping flow on fpgas,” in Proceedings of the 2024
and energy cost evaluation of LLM workloads on hardware ACM/SIGDA International Symposium on Field Programmable Gate
accelerators. It facilitates the identification of energy bottle- Arrays, 2024, pp. 223–234.
[17] C. Guo et al., “Olive: Accelerating large language models via hardware-
necks early in the design process. The framework integrates friendly outlier-victim pair quantization,” in Proceedings of the 50th
a generalized PyTorch implementation of LLMs together with Annual International Symposium on Computer Architecture, 2023, pp.
the ZigZag simulation engine. By altering the workload graph 1–15.
[18] J. Jang et al., “Figna: Integer unit-based accelerator design for fp-int
for simulation and normalizing results to the full model, we gemm preserving numerical accuracy,” in 2024 IEEE International Sym-
achieve significant reductions in simulation time at negli- posium on High-Performance Computer Architecture (HPCA). IEEE,
gible loss of accuracy. For the sake of demonstration, we 2024, pp. 760–773.
[19] ONNX Community, “Onnx: Open neural network exchange,” https://
identify energy inefficiencies in LLM inference for three github.com/onnx/, 2024.
common scenarios for both cloud and edge accelerators. Our [20] A. Paszke et al., “PyTorch: An Imperative Style, High-Performance
framework is available open-source at github.com/KULeuven- Deep Learning Library,” 2019. [Online]. Available: https://arxiv.org/
abs/1912.01703
MICAS/zigzag-llm. [21] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat
models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.09288
VI. ACKNOWLEDGEMENTS [22] M. Horowitz, “1.1 computing’s energy problem (and what we can do
about it),” in 2014 IEEE International Solid-State Circuits Conference
This project has been partly funded by the European Re- Digest of Technical Papers (ISSCC), 2014, pp. 10–14.
search Council (ERC) under grant agreement No. 101088865, [23] N. Nayak et al., “Fusemax: Leveraging extended einsums to
the European Union’s Horizon 2020 program under grant optimize attention accelerator design,” in Machine Learning for
Computer Architecture and Systems 2024, 2024. [Online]. Available:
agreement No. 101070374, the Flanders AI Research Program https://openreview.net/forum?id=HKwsTuKEpo
and KU Leuven. [24] T. Miyoshi et al., “FLAT: a GPU programming framework to provide
embedded MPI,” in Proceedings of the 5th Annual Workshop on General
R EFERENCES Purpose Processing with Graphics Processing Units, ser. GPGPU-5.
New York, NY, USA: Association for Computing Machinery, 2012, p.
[1] Q. Hu et al., “Characterization of large language model development 20–29. [Online]. Available: https://doi.org/10.1145/2159430.2159433
in the datacenter,” in 21st USENIX Symposium on Networked Systems [25] Google, “Coral ai,” https://coral.ai/, 2020.
Design and Implementation (NSDI 24), 2024, pp. 709–729. [26] OpenAI, “Gpt-3: Language models are few-shot learners,” https://arxiv.
[2] Y. Lin et al., “QServe: W4A8KV4 Quantization and System Co-design org/abs/2005.14165, 2020.
for Efficient LLM Serving,” arXiv preprint arXiv:2405.04532, 2024. [27] T. J. Ham et al., “ELSA: Hardware-Software Co-design for Efficient,
[3] C. Kachris, “A survey on hardware accelerators for large language Lightweight Self-Attention Mechanism in Neural Networks,” in 2021
models,” arXiv preprint arXiv:2401.09890, 2024. ACM/IEEE 48th Annual International Symposium on Computer Archi-
[4] L. Mei et al., “ZigZag: Enlarging joint architecture-mapping design tecture (ISCA), 2021, pp. 692–705.
space exploration for DNN accelerators,” IEEE Transactions on Com- [28] S. Ma et al., “The Era of 1-bit LLMs: All Large Language Models are in
puters, vol. 70, no. 8, pp. 1160–1174, 2021. 1.58 Bits,” 2024. [Online]. Available: https://arxiv.org/abs/2402.17764
[5] Z. Yuan et al., “Llm inference unveiled: Survey and roofline model
insights,” 2024. [Online]. Available: https://arxiv.org/abs/2402.16363
[6] Q. Huang, “Mind the gap: Attainable data movement and operational
intensity bounds for tensor algorithms,” in International Symposium on
Computer Architecture (ISCA), May 2024.
Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.