0% found this document useful (0 votes)
80 views6 pages

Energy Cost Modelling For Optimizing Large Language Model Inference On Hardware Accelerators

This paper presents an open-source framework for modeling Large Language Model (LLM) workloads on dedicated hardware accelerators to optimize energy efficiency during inference. It highlights the importance of addressing energy bottlenecks in both cloud and edge scenarios and introduces techniques to reduce simulation time while maintaining accuracy. The framework aims to facilitate early-stage design exploration and improve the energy efficiency of LLM inference through case studies and insights into critical bottlenecks.

Uploaded by

malykaawais
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views6 pages

Energy Cost Modelling For Optimizing Large Language Model Inference On Hardware Accelerators

This paper presents an open-source framework for modeling Large Language Model (LLM) workloads on dedicated hardware accelerators to optimize energy efficiency during inference. It highlights the importance of addressing energy bottlenecks in both cloud and edge scenarios and introduces techniques to reduce simulation time while maintaining accuracy. The framework aims to facilitate early-stage design exploration and improve the energy efficiency of LLM inference through case studies and insights into critical bottlenecks.

Uploaded by

malykaawais
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Energy Cost Modelling for Optimizing Large

Language Model Inference on Hardware


Accelerators
2024 IEEE 37th International System-on-Chip Conference (SOCC) | 979-8-3503-7756-9/24/$31.00 ©2024 IEEE | DOI: 10.1109/SOCC62300.2024.10737844

Robin Geens† , Man Shi† , Arne Symons† , Chao Fang†,‡ , Marian Verhelst†

MICAS, KU Leuven, Belgium

School of Electronic Science and Engineering, Nanjing University, China
[email protected]

Abstract—The rise of Large Language Models (LLMs) has architecture configurations. It is hence important to conduct
significantly escalated the demand for efficient LLM inference, rapid modeling and evaluation of LLM workloads on a va-
primarily fulfilled through cloud-based GPU computing. This riety of hardware accelerator topologies to locate the energy
approach, while effective, is associated with high energy con-
sumption resulting in large operating expenses and considerable bottlenecks during the early design stages, and facilitate the
carbon footprints. In the meantime, growing privacy concerns discovery of optimal energy-efficient design points.
advocate for inference on edge devices, which are constrained Several modeling tools have been developed to enable
by a limited battery capacity. Both cloud and edge scenarios rapid early-stage analysis and help understand the impact of
necessitate energy-efficient LLM inference strategies. hardware parameters on the performance of LLM inference
This paper addresses the urgent need for energy-efficient
inference by proposing an open-source framework designed to [5]–[7]. Despite their utility towards latency assessments, these
model LLM workloads on dedicated accelerators. Our frame- tools lack quantitative energy estimations. As such, the chal-
work facilitates early identification of energy bottlenecks through lenge of identifying energy-efficient design points on dedicated
rapid modeling of the execution efficiency of a wide range accelerators remains unaddressed. To address these limitations,
of LLMs on diverse hardware architectures. Key innovations we propose an open-source and easy-to-use framework that
include a PyTorch-based generalized LLM template to easily
generate custom workload graphs, extensions of the ZigZag aims at exposing the energy bottlenecks of LLM inference on
design space exploration framework and techniques to signifi- custom hardware architectures.
cantly speed up simulation time at a negligible loss of accuracy. In summary, the main contributions of this paper are as
Using a representative hardware architecture, we conduct three follows:
case studies to reveal critical energy bottlenecks in Llama2-7B 1) We develop a user-friendly framework to model LLM
inference, revealing that 1) memory-bound computing in the
decode stage is detrimental not only for the latency, but also inference energy and latency on user-defined hardware
for the energy cost; 2) aggressive weight-only quantization can accelerator architectures.
reduce the energy cost by 4.6× and shift the bottleneck from 2) We introduce techniques to save simulation time by
weight fetching to the attention mechanism; 3) in edge scenarios, reducing the workload graph and the required number
the relative energy cost of the prefill stage is more significant, of simulations, and adjusting the results afterwards.
encouraging efforts to optimize both prefill and decode stage. Our
framework is available open-source at github.com/KULeuven- 3) We deploy the framework to uncover often-overlooked
MICAS/zigzag-llm. energy bottlenecks in the inference of popular LLMs,
providing insights for efficiency improvements.
I. I NTRODUCTION
The rapid advancement of generative AI, particularly Large
Language Models (LLMs), has significantly increased the
demand for LLM inference. This growing demand is predom-
inantly met through cloud-based GPU computing [1]. While
this approach supports low-latency applications, it comes with
high energy consumption. Furthermore, privacy concerns are
prompting LLM inference on edge devices, which also face
stringent energy constraints due to limited battery capacity [2].
These trends emphasize the need for energy-efficient LLM
inference in the cloud and at the edge. Fig. 1. Diagram of a single layer of a decoder-only transformer.
To improve efficiency, the LLM workloads are increasingly
II. BACKGROUND
executed on dedicated hardware accelerators [3]. However,
the design of these accelerators is challenging, as the energy A. Large Language Models
consumption bottlenecks vary strongly in function of workload Key Components. Figure 1 illustrates the fundamental
characteristics [4], leading to different optimal accelerator computationally-intensive components of transformer-based

979-8-3503-7756-9/24/$31.00 ©2024 IEEE


Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.
LLMs. These components, repeated in multiple layers, include
linear projection, self-attention, and feed-forward networks
(FFNs). In concrete terms, linear projection maps input em-
beddings to query (Q), key (K), and value (V ) representations
using learned weights. Self-attention then focuses on scoring
based on Q-K similarities. With multi-head attention, different
Q, K, and V pairs are distributed across multiple heads
that compute attention scores in parallel. The following FFN
adopts two linear transformations with a non-linear activation
in between, processing embeddings independently to capture
complex patterns by mixing information from different heads.
Prefill and decode stage. The inference of decode-only
transformer-based LLMs consists of two phases: prefill and
decode. In the prefill stage, many tokens are provided as input
and processed simultaneously. In the decode phase, all the
following tokens are generated, one-by-one. Each time, the
newly generated token is appended to the previous input and
the compound sequence is presented as an input to the model.
The different computational properties of the two thus present
a distinct set of challenges and opportunities for optimization Fig. 2. Overview of the framework.
of LLM inference across diverse hardware platforms.
KV-Caching. Key-Value (KV) caching [8] is often used to
optimize LLM inference by reducing the computational redun-
dancy associated with the decode phase. It stores the key and
value projections of previous input tokens, allowing them to be
reused for new tokens. However, the unique dataflow related
to KV-caching on dedicated accelerators remains unexplored.

B. Dedicated Accelerators for LLMs


Many research efforts [9]–[12] focus on the design of Fig. 3. Chosen sequence length and representative token for prefill and decode
phase simulations.
dedicated accelerators for transformer networks, using energy
or latency as performance metric. With the increasing com-
plexity of LLMs, various quantization techniques [13]–[15] While these frameworks are effective in helping the user
and dedicated accelerators [16]–[18] have been proposed to understand the impact of hardware changes on the inference
improve system-level efficiency and processing time. As LLM latency, they all lack quantitative energy results. Their key
inference becomes more prevalent both in the cloud [1] and at differences are summarized in Table I.
the edge [2], the optimal design points for dedicated LLM
accelerators vary widely, underscoring the urgent need for TABLE I
C OMPARISON OF LLM INFERENCE MODELLING FRAMEWORKS
early-stage design space exploration through modeling.
Simulation Latency Energy Hardware
C. LLM Accelerator Modelling speed accuracy accuracy details

Several modeling tools have been developed to enable LLM-Viewer [5] µs − 0 −−


Orojenesis [6] ms + 0 −
early-stage rapid analysis and help understand the impact of LLMCompass [7] s ++ 0 ++
hardware parameters on the performance of LLM inference. This work s ++ ++ ++
Notably, LLM-viewer [5] is a roofline model-based framework
that allows for a quick back-of-the-envelop calculation of the
inference latency, based on the hardware’s peak performance III. T HE P ROPOSED F RAMEWORK
and memory bandwidth. Additionally, it calculates the (higher-
level) memory consumption based on the model’s size and has An overview of the framework is shown in Figure 2. The
some notion of quantization. NVIDIA’s Orojenesis [6] goes main components are the automatic workload generation (Sec.
one step further and also incorporates on-chip buffer size into III-A), modules to eliminate repeated layers (Sec. III-B) and
the latency model. While these two are architecture-agnostic to efficiently model KV-caching (Sec. III-C), and the extended
and only consider key hardware parameters, LLMCompass [7] simulation engine (Sec. III-D). The user inputs consist of
allows for a detailed, multi-core hardware definition in order the selection of the LLM, the quantization scheme and the
to estimate the latency of LLM inference at high accuracy. hardware definition of the accelerator (Sec. III-E).

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.
A. Workload Graph Generation linear in terms of sequence length , we select the token halfway
Our simulation engine accepts workloads in Open Neural the decode sequence as candidate for simulation. This way,
Network Exchange (ONNX) [19] format. ONNX represen- the overestimation of energy and latency for the tokens earlier
tations are available for many workloads as it is a widely in the decode stage can even out the underestimation of the
used standard. In order to have full control of the internals tokens that come after.
of the LLMs to be modeled, the framework houses an LLM Furthermore, to model a full LLM inference for a realistic
implementation programmed in PyTorch [20] with an auto- scenario, we assume that half of the model’s sequence length
matic conversion to ONNX. The implementation can easily consist of input tokens that are processed in the prefill stage,
be parameterized to correctly represent many popular LLM and the other half are output tokens generated in the decode
models. Moreover, this enables us to force the use of KV- stage. This common scenario occurs, for example, when the
caching directly into the workload graph, which would be LLM is used to rewrite a piece of text: the prefill stage is used
impossible when using the available ONNX graphs. In this to load the input text, and the altered text is generated in the
case, the input sequence length is forced to 1 and the key and decode stage.
value caches are treated as inputs of the network graph (of a Thus, to simulate the decode phase, we construct the sce-
chosen size, see Section III-C). A separate ONNX graph is nario where the KV-cache is filled up by the previous ( 34 L−1)
generated for the prefill and decode stage, as shown in Figure tokens, the ( 43 L)-th token is provided as input in order to
2. Weight and activation quantization is supported by manually generate the ( 34 L + 1)-th, where L is the model’s context
attaching a custom attribute to each layer in the ONNX graph. window size. This principle is demonstrated in Figure 3. The
results of this single simulation are then multiplied with L2 to
B. Elimination of Repeated Layers represent the full decode process.
Transformer architectures exhibit a high level of regularity. We validate this approach by simulating the full auto-
Though the LLM may have a large number of parameters, the regressive decode phase of Llama2-7B [21], token by token.
corresponding transformer architecture repeats the exact same The sum of inference energy for each consecutive energy
layer many times (but with different weights), and each layer is compared to the result from the proposed approach. Our
consists of many identical heads. To greatly save simulation approach exhibits a relative error of 1.0 ∗ 10−5 and speeds up
time, we parameterize the in-house PyTorch model with only the decode simulation by a factor of 2048.
a single layer and a single head before exporting it to ONNX.
D. Simulation Engine
The result is a trivial ONNX graph that contains all operations,
but only for a single layer and head. In accordance to Figure 1, The simulation itself is performed by ZigZag [4], which
the query, value, key and output projections are each combined we extend with support for LLM workloads and quantization.
for all heads and represented as a single matrix multiplication. ZigZag is capable of finding the optimal spatial and temporal
Afterwards, the simulation results are multiplied by the correct unrolling for the given workload and architecture, in terms of
number of layers and heads to obtain the inference results for the chosen criterion (energy, delay, or energy-delay product).
the full model. This technique drastically speeds up simulation One important remark is that while non-linear layers (e.g.
time while maintaining accuracy. Furthermore, it ensures that SoftMax, GeLU and LayerNorm) are part of the ONNX graph,
the simulation time is only marginally dependent on the model ZigZag skips them during simulation. However, since the
size. For reference, the hardware performance simulation of computational complexity of LLM inference is dominated by
both the OPT-125M and GPT3-175B model both take around matrix multiplications, this simplification is expected to have
20 seconds on a medium-sized server. minimal impact on the results. Moreover, hardware accelera-
tors often have built-in support for the non-linear layers such
C. Modelling Decode Phase with KV-caching that they can be computed in parallel and at a low energy cost.
The decode phase of LLM inference is auto-regressive,
meaning that all tokens after the prefill stage are gener- E. Hardware Architecture Definition
ated one-by-one. Simulating the full decode process can be In order to model LLM inference on hardware accelerators
achieved by running the simulation for each generated token and derive practically usable conclusions and insights, the
and only providing a single token as input in each run, which hardware model should be realistic and representative for
is a cumbersome and time-intensive process. real-life use cases. For energy-efficient inference, accurately
Instead, we approximate the energy and latency results of modelling the memory hierarchy and the corresponding access
the full decode phase by simulating the generation of a single energy costs are especially important.
token somewhere along the decode sequence, and multiplying As such, our work employs the ZigZag hardware architec-
found energy and latency results by the number of tokens. This ture model. This model incorporates: 1) the compute core, in
comes at the risk of over-generalizing the workload’s temporal the form of a processing element (PE) array with a flexible
mapping: generating the first token (when the KV-cache is shape and local registers of variable size; 2) a hierarchy of
filled by the prefill phase only) might result a different memory memories; 3) the interconnection pattern between the PEs and
access pattern than the last token (when the KV-cache is full). the memory hierarchy. At each level, the memory hierarchy
Since the number of operations of the attention operation is has a variable capacity, bandwidth, number of read/write ports,

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.
capacity, and memory bandwidth, as shown in Figure 4. The
specifications for the cloud accelerator align with those used
in a previous study [24], while the edge specifications are
matched with the resources of the Google Edge TPU [25].
Table II lists the energy costs of the multiply-accumulate
(MAC) operation along with different memory read and write
costs utilized in the case studies. As vendors have strict NDAs
for different technologies, we approximate the energy costs
based on a 45nm process [22]. These energy costs can easily be
customized by the user as framework inputs. All case studies
assume the same MAC energy cost, regardless of datatype.

Fig. 4. Hardware diagram of the target platform to model LLM inference. B. Energy and Latency Analysis on Cloud Accelerator
As an initial demonstration, we analyze the energy and
and targeted operands. An example of a resulting accelerator
latency breakdown of a full inference (prefill and decode) of
architecture is shown in Figure 4.
Llama2-7B [21] on the Cloud Accelerator. The precision of
IV. C ASE STUDIES the weights, activations and intermediate output values is set
To demonstrate the practical applicability of the framework, to 32 bits (e.g. FP32). The results are shown in Figure 5.
we conduct three case studies and extract hardware-oriented From the latency breakdown, we observe that the prefill
insights from each of them. In all cases, we use Llama2-7B stage achieves nearly ideal computational latency with neg-
[21] as workload. This openly available and medium-sized ligible memory stalls. Its efficiency stems from operating
LLM model is often used as a representative LLM workload in a compute-bound regime, consistent with roofline model
in literature. For the sake of reducing the design space, we fix analysis [5]. This underscores the prefill stage’s ability to
the dataflow to the commonly used weight-stationary dataflow leverage extensive parallelism, which is inherently facilitated
for both prefill and decode stages. The batch size is set to 8 to by cloud accelerators.
mirror a practical setting while the resulting energy and latency During the decode stage, the accelerator operates in the
values are normalized to a single batch. Furthermore, energy- memory-bound region and the latency is dominated by mem-
delay product (EDP) is adopted as the optimization criterion ory stalls, i.e. the time that the PE array cannot be fully
in all experiments, enabling the identification of design points utilized because the data still needs to be transferred from
that offer a balanced trade-off between latency and energy. memory. The low arithmetic intensity stems from the unusual
To visualize the results, the modeled layers with similar compute pattern during the auto-regressive decode stage with
compute behaviour are grouped and merged into a single KV-caching. For the next-token prediction, only a single, new
category. We define three groups: linear projection, for all token is provided as input, yet all model weights need to be
query, key, value and output projections; Attention, for the fetched from memory.
Q × K T and S × V layers; and feed-forward network (FFN), The results clearly show that operating in the memory-
for the two layers of the fully-connected networks after each bound region not only penalizes the inference latency, but it
set of heads and the gate layer in Llama models. The three greatly increases the inference energy cost since more higher-
groups are also shown in Figure 1. level, expensive memory accesses are incurred. As such, the
main energy and latency bottleneck in this case is the fetching
TABLE II of high-precision weights from higher-level memories.
E NERGY COSTS ASSUMED FOR THE GENERIC HARDWARE ARCHITECTURE

Operation Energy Cost *[pJ]


Off-chip DRAM read/write (1b) 25
On-chip SRAM read/write (1b) 1
Register read/write (1b) 0.025
MAC operation (32b) 1.5
* derived for a 45nm process with a 0.9V supply [22].

A. Hardware Setup
The case studies use a generalized hardware architecture
[23] that is representative for state-of-the-art accelerators for
both cloud and edge scenarios. The cloud and edge acceler-
ators employ the same architecture template, with specifica- Fig. 5. The energy cost and latency performance of Llama2-7B on the Cloud
tions in terms of compute capacity (PE array size), memory Accelerator of Figure 4.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Impact of different quantization schemes on the energy and the latency cost of Llama2-7B inference on the Cloud Accelerator of Figure 4.

Another observation is that the attention module is rather in terms of energy and latency due to the massive parameter
inconsequential in comparison to weight-based layers. This size. With the 1-bit weight quantization, the significant energy
effect is even more pronounced in the larger, commercially cost of parameter movements is alleviated, reducing the total
used models such as GPT3-175B [26]. In larger models, energy cost by 4, 6× and shifting the bottleneck entirely to
the problem size of the weight-based significantly increases full-precision input activation fetching in the attention layers.
whereas the sequence length, which determines the complexity In conclusion, weight-only quantization helps to reduce both
of the attention layers, stays in the same order of magnitude. energy and latency. Aggressive weight-only quantization goes
This observation differs from other works in literature that as far as to shift the bottleneck to the full-precision attention
suggest the attention module dominates the inference cost due module.
to its quadratic complexity in function of the sequence length,
prompting special hardware-enabled techniques to accelerate
this module [27].

C. Impact of Quantization Scheme


A common quantization technique for LLMs involves quan-
tizing weights to lower bits while maintaining activations at
high precision. This addresses the high memory access cost for
weights observed in Section IV-B. In this section, we compare
two weight-only quantization schemes: the W4A16 [13] with
the W1A32 [28], exhibiting the energy cost across different
precision levels. In W4A16, the intermediate output values
Fig. 7. The energy cost and latency performance of Llama2-7B model
are assumed to be full-precision 32 bits. W1A32 is chosen to quantized to W4A16 on the Cloud Accelerator of Figure 4.
demonstrate the benefits of extreme weight quantization.
We do not scale the number of bits that are processed in a D. Comparison with Edge Accelerator
single MAC unit. In other words, each PE performs at most a Finally, we compare the Cloud Accelerator (used in Sec.
single MAC operation each cycle, regardless of data type. IV-B and IV-C) to the Edge Accelerator (defined in Sec. IV-A).
Lowering the number of weight bits has a double effect: Since we found weight-only quantization to be beneficial, we
On one hand, it increases the number of weights that are evaluate the Llama2-7B model with W4A16 quantization. The
transferred per memory access, decreasing the cost per weight. results are shown in Figure 7.
On the other hand, it increases the number of weights that For the Edge Accelerator, the compute part of the latency
can be stored in a given memory, increasing the data locality breakdown is 64 times larger, since it has 64 times less PEs.
and reuse at that level. In other words, quantizing the weights A second key difference in terms of compute-based latency is
reduces both the memory access cost per weight and the that the Edge Accelerator operates at a clock frequency that
number of accesses. is about 4 times slower, which is not shown in the latency
The results are shown in Figure 6. Both the W4A16 and results in terms of number of cycles. For the memory part of
W1A32 schemes significantly reduce energy consumption and the latency breakdown, there is no obvious, quantitative rela-
latency in the linear projection and FFN layers compared to tionship between architecture parameters and latency. Though
the W32A32 baseline (Figure 5). For W4A16 quantization, the the Edge Accelerator has a much lower on-chip and off-chip
energy results for the attention layers are slightly better, since bandwidth, less bandwidth is required by the much smaller PE
the input and final output activations use a lower precision. array. On the other hand, since the PE array is 64 times smaller,
Despite the 8-fold reduction in weight precision, the LLM less data can be reused at register and SRAM level, requiring
inference remains bottlenecked by higher-level weight fetching more energy accesses and causing more memory stalls.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.
In terms of energy, the performance of both accelerators is [7] H. Zhang et al., “A hardware evaluation framework for large language
comparable, though the Edge Accelerator performs slightly model inference,” 2023. [Online]. Available: https://arxiv.org/abs/2312.
03134
worse. The MAC and low-level memory access energy is [8] M. Shoeybi et al., “Megatron-lm: Training multi-billion param-
exactly the same, since this only depends on the problem size eter language models using model parallelism,” arXiv preprint
and cost per bit for MAC unit and register. There are two arXiv:1909.08053, 2019.
[9] Y. Qin et al., “Fact: Ffn-attention co-optimized transformer architecture
main differences in the energy profile: 1) the smaller PE array with eager correlation prediction,” in Proceedings of the 50th Annual
allows for less data reuse at the register level, incurring more International Symposium on Computer Architecture, 2023, pp. 1–14.
SRAM memory accesses to replace the values in the registers; [10] M. Shi et al., “Bitwave: Exploiting column-based bit-level sparsity for
deep learning acceleration,” in 2024 IEEE International Symposium on
2) since the SRAM is smaller, less data reuse is possible at High-Performance Computer Architecture (HPCA). IEEE, 2024, pp.
this memory level, incurring more DRAM accesses. 732–746.
In the Cloud Accelerator, the energy cost of the prefill stage [11] H. Fan et al., “Adaptable butterfly accelerator for attention-based nns
via hardware and algorithm co-design,” in 2022 55th IEEE/ACM Inter-
is dominated by register accesses, indicating that the data reuse national Symposium on Microarchitecture (MICRO). IEEE, 2022, pp.
is optimal. In the Edge Accelerator, the energy cost of the 599–615.
SRAM component becomes non-negligible in the prefill stage [12] C. Fang et al., “An algorithm–hardware co-optimized framework for
accelerating N:M sparse transformers,” IEEE Transactions on Very Large
due to the frequent replacement of the intermediate output Scale Integration (VLSI) Systems, vol. 30, no. 11, pp. 1573–1586, 2022.
value in the output register. Thus, compared to the Cloud [13] W. Shao et al., “OmniQuant: Omnidirectionally Calibrated Quantization
Accelerator, the relative contribution of the prefill stage is for Large Language Models,” in The Twelfth International Conference
on Learning Representations (ICLR), 2024.
larger in the Edge Accelerator, highlighting the importance [14] J. Lin et al., “AWQ: Activation-aware Weight Quantization for LLM
of prefill stage optimizations in edge scenarios. Compression and Acceleration,” in The Seventh Annual Conference on
Machine Learning and Systems (MLSys), 2024.
V. C ONCLUSION [15] E. Frantar et al., “Optq: Accurate quantization for generative pre-trained
transformers,” in The Eleventh International Conference on Learning
This paper presents a novel open-source and easy-to-use Representations (ICLR), 2023.
[16] S. Zeng et al., “Flightllm: Efficient large language model inference
framework that addresses the lack of tools for rapid modeling with a complete mapping flow on fpgas,” in Proceedings of the 2024
and energy cost evaluation of LLM workloads on hardware ACM/SIGDA International Symposium on Field Programmable Gate
accelerators. It facilitates the identification of energy bottle- Arrays, 2024, pp. 223–234.
[17] C. Guo et al., “Olive: Accelerating large language models via hardware-
necks early in the design process. The framework integrates friendly outlier-victim pair quantization,” in Proceedings of the 50th
a generalized PyTorch implementation of LLMs together with Annual International Symposium on Computer Architecture, 2023, pp.
the ZigZag simulation engine. By altering the workload graph 1–15.
[18] J. Jang et al., “Figna: Integer unit-based accelerator design for fp-int
for simulation and normalizing results to the full model, we gemm preserving numerical accuracy,” in 2024 IEEE International Sym-
achieve significant reductions in simulation time at negli- posium on High-Performance Computer Architecture (HPCA). IEEE,
gible loss of accuracy. For the sake of demonstration, we 2024, pp. 760–773.
[19] ONNX Community, “Onnx: Open neural network exchange,” https://
identify energy inefficiencies in LLM inference for three github.com/onnx/, 2024.
common scenarios for both cloud and edge accelerators. Our [20] A. Paszke et al., “PyTorch: An Imperative Style, High-Performance
framework is available open-source at github.com/KULeuven- Deep Learning Library,” 2019. [Online]. Available: https://arxiv.org/
abs/1912.01703
MICAS/zigzag-llm. [21] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat
models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.09288
VI. ACKNOWLEDGEMENTS [22] M. Horowitz, “1.1 computing’s energy problem (and what we can do
about it),” in 2014 IEEE International Solid-State Circuits Conference
This project has been partly funded by the European Re- Digest of Technical Papers (ISSCC), 2014, pp. 10–14.
search Council (ERC) under grant agreement No. 101088865, [23] N. Nayak et al., “Fusemax: Leveraging extended einsums to
the European Union’s Horizon 2020 program under grant optimize attention accelerator design,” in Machine Learning for
Computer Architecture and Systems 2024, 2024. [Online]. Available:
agreement No. 101070374, the Flanders AI Research Program https://openreview.net/forum?id=HKwsTuKEpo
and KU Leuven. [24] T. Miyoshi et al., “FLAT: a GPU programming framework to provide
embedded MPI,” in Proceedings of the 5th Annual Workshop on General
R EFERENCES Purpose Processing with Graphics Processing Units, ser. GPGPU-5.
New York, NY, USA: Association for Computing Machinery, 2012, p.
[1] Q. Hu et al., “Characterization of large language model development 20–29. [Online]. Available: https://doi.org/10.1145/2159430.2159433
in the datacenter,” in 21st USENIX Symposium on Networked Systems [25] Google, “Coral ai,” https://coral.ai/, 2020.
Design and Implementation (NSDI 24), 2024, pp. 709–729. [26] OpenAI, “Gpt-3: Language models are few-shot learners,” https://arxiv.
[2] Y. Lin et al., “QServe: W4A8KV4 Quantization and System Co-design org/abs/2005.14165, 2020.
for Efficient LLM Serving,” arXiv preprint arXiv:2405.04532, 2024. [27] T. J. Ham et al., “ELSA: Hardware-Software Co-design for Efficient,
[3] C. Kachris, “A survey on hardware accelerators for large language Lightweight Self-Attention Mechanism in Neural Networks,” in 2021
models,” arXiv preprint arXiv:2401.09890, 2024. ACM/IEEE 48th Annual International Symposium on Computer Archi-
[4] L. Mei et al., “ZigZag: Enlarging joint architecture-mapping design tecture (ISCA), 2021, pp. 692–705.
space exploration for DNN accelerators,” IEEE Transactions on Com- [28] S. Ma et al., “The Era of 1-bit LLMs: All Large Language Models are in
puters, vol. 70, no. 8, pp. 1160–1174, 2021. 1.58 Bits,” 2024. [Online]. Available: https://arxiv.org/abs/2402.17764
[5] Z. Yuan et al., “Llm inference unveiled: Survey and roofline model
insights,” 2024. [Online]. Available: https://arxiv.org/abs/2402.16363
[6] Q. Huang, “Mind the gap: Attainable data movement and operational
intensity bounds for tensor algorithms,” in International Symposium on
Computer Architecture (ISCA), May 2024.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on April 28,2025 at 09:48:02 UTC from IEEE Xplore. Restrictions apply.

You might also like