0% found this document useful (0 votes)

2 views14 pages

ETRI Journal - 2024 - Park - NEST C A Deep Learning Compiler Framework For Heterogeneous Computing Systems With Artificial

The document introduces NEST-C, a deep learning compiler framework designed to optimize the performance of deep learning models on heterogeneous AI accelerators such as NPUs and PIMs. NEST-C addresses inefficiencies in existing frameworks by utilizing profiling-based quantization, dynamic graph partitioning, and multi-level intermediate representation integration, resulting in improved computational efficiency and adaptability. The framework enhances model deployment across various AI accelerators, achieving higher throughput, lower latency, and better resource utilization.

Uploaded by

eyereece

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views14 pages

ETRI Journal - 2024 - Park - NEST C A Deep Learning Compiler Framework For Heterogeneous Computing Systems With Artificial

Uploaded by

eyereece

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Received: 24 March 2024 Revised: 28 June 2024 Accepted: 13 August 2024

DOI: 10.4218/etrij.2024-0139

SPECIAL ISSUE

NEST-C: A deep learning compiler framework for

heterogeneous computing systems with artificial
intelligence accelerators

Jeman Park 1 | Misun Yu 1 | Jinse Kwon 1 | Junmo Park 2 |

Jemin Lee 1 | Yongin Kwon 1

1
Artificial Intelligence Computing
Research Laboratory, Electronics and Abstract
Telecommunications Research Institute, Deep learning (DL) has significantly advanced artificial intelligence (AI); how-
Daejeon, Republic of Korea
ever, frameworks such as PyTorch, ONNX, and TensorFlow are optimized for
2
Samsung Electronics, Hwaseong,
Republic of Korea
general-purpose GPUs, leading to inefficiencies on specialized accelerators
such as neural processing units (NPUs) and processing-in-memory (PIM)
Correspondence devices. These accelerators are designed to optimize both throughput and
Jemin Lee and Yongin Kwon, Artificial
Intelligence Computing Research energy efficiency but they require more tailored optimizations. To address
Laboratory, Electronics and these limitations, we propose the NEST compiler (NEST-C), a novel DL frame-
Telecommunications Research Institute,
work that improves the deployment and performance of models across various
Daejeon, Republic of Korea.
Email: [email protected] and AI accelerators. NEST-C leverages profiling-based quantization, dynamic graph
[email protected] partitioning, and multi-level intermediate representation (IR) integration for
efficient execution on diverse hardware platforms. Our results show that
Funding information
This study is supported by a grant from NEST-C significantly enhances computational efficiency and adaptability
the Institute of Information & across various AI accelerators, achieving higher throughput, lower latency,
Communications Technology Planning &
Evaluation (IITP), funded by the Korean
improved resource utilization, and greater model portability. These benefits
government (MSIT) (No. RS- contribute to more efficient DL model deployment in modern AI applications.
2023-00277060, Development of
OpenEdge AI SoC hardware and software KEYWORDS
platform). AI accelerator, deep learning compiler, heterogeneous computing, model quantization,
multi-level IR

1 | INTRODUCTION the development and deployment of sophisticated neural

networks.
In recent years, deep learning (DL) has revolutionized However, these frameworks are primarily designed
the field of artificial intelligence (AI), significantly for general-purpose GPUs, which can lead to inefficien-
impacting areas such as image recognition, natural cies in specialized tasks. To address these inefficiencies,
language processing, and autonomous systems. DL AI accelerators have been developed to maximize both
frameworks such as PyTorch [1], ONNX [2], and Tensor- throughput and energy efficiency compared to GPUs.
Flow [3] have significantly advanced the field by enabling Although AI accelerators are tailored for DL tasks, they

This is an Open Access article distributed under the term of Korea Open Government License (KOGL) Type 4: Source Indication + Commercial Use Prohibition +
Change Prohibition (http://www.kogl.or.kr/info/licenseTypeEn.do).
1225-6463/$ © 2024 ETRI

ETRI Journal. 2024;46(5):851–864. wileyonlinelibrary.com/journal/etrij 851

22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
852 PARK ET AL.

still rely on DL frameworks, creating a gap between these distribute the workload of DL models. This is achieved
DL frameworks and AI accelerators can hinder the devel- by developing algorithms that enable dynamic task
opment and deployment of neural networks in fields with partitioning in mixed hardware systems, which include
various constraints. multiple AI accelerators, thereby maximizing compu-
Traditional DL frameworks provide abstract func- tational efficiency.
tionalities but often lack the detailed optimizations • To enhance AI accelerator portability: By providing
required by various AI accelerators. To solve this prob- integration interfaces at each stage of the intermediate
lem, DL compilers have been developed [4]. These com- representation (IR) in the compilation process,
pilers take DL models developed within DL frameworks NEST-C makes it easier to connect various hardware
as inputs, optimize them, and translate them for specific with compilers. This approach ensures that the
hardware to produce efficient executable code. The opti- compiler can be utilized more conveniently, facilitating
mization process includes graph optimization, memory seamless operation across different AI accelerators
management, parallel processing, quantization, and exe- and enhancing the system’s overall flexibility
cution tuning. DL models are efficiently optimized to and efficiency.
run on a diverse range of hardware, from datacenter
servers to mobile devices and embedded IoT sensors.
Leading DL compilers such as tensor virtual machine 2 | BACKGROUND AND
(TVM) [5], Glow [6], and XLA [7] offer specialized fea- RELATED WORK
tures and techniques for optimization and deployment
across a wide range of applications. However, these 2.1 | Common structures of DL
existing solutions often fail to optimize the performance compilers
of newer AI accelerators such as neural processing units
(NPUs) [8–12] and processing-in-memory (PIM) [13, 14] The general design architecture of a DL compiler is
devices. divided into front-end and back-end architectures to
The rapid advancement of AI accelerators such as facilitate optimization and support for various hardware
NPUs and PIM devices presents significant challenges. platforms [4]. Additionally, there is an abstract represen-
Traditional compilers, primarily designed for CPUs and tation known as IR. IR is a crucial concept in compiler
GPUs, struggle to maximize the performance and lever- design that acts as a bridge between high-level program-
age the unique features of these new accelerators. This ming languages and machine code. This simplifies the
gap necessitates a compiler framework that can complex process of translating human-readable code into
efficiently optimize DL models for these diverse AI instructions that can be executed by hardware. LLVM
accelerators. IR [16] is among the most commonly used IRs for
To overcome these challenges, this paper proposes targeting CPUs and GPUs, owing to its versatility and
the NEST compiler (NEST-C) [15], an advanced DL com- extensive support within the LLVM compiler. However,
piler framework designed to simplify deployment and because it is a low-level representation, LLVM IR is not
enhance the efficient execution of DL models. As an inherently suited for optimization at the level of DL
open-source project, NEST-C generates optimized codes operators, which often requires more abstract and higher
for various AI accelerators, including NPUs and PIM level transformations to achieve efficient execution on
devices. Furthermore, NEST-C offers tuning features and various hardware platforms. Consequently, DL compilers
tools tailored to the characteristics of each AI accelerator. require higher level IRs that are better suited for expres-
The main contributions of NEST-C are as follows: sing and optimizing DL models. To accommodate the
complexity and diversity of DL models and hardware
• To enable DL models for heterogeneous AI accelera- platforms, DL compilers may employ multiple levels of
tors: NEST-C facilitates the use of DL models on a IR, each serving a different stage of the compilation pro-
variety of AI accelerators, such as NPUs and PIMs, cess. The front end handles DL models from frameworks
by supporting necessary adaptations for quantization such as TensorFlow and PyTorch, performing optimiza-
and optimization. This ensures that models can tions such as operation fusion, the elimination of redun-
efficiently operate across heterogeneous hardware dant operations, and memory access optimization to
platforms, broadening their applicability and improve efficiency. It then abstracts the model’s struc-
performance. ture and operations into standardized high-level IRs. In
• To optimize the DL model execution for multiple AI the front end, a high-level IR is used to optimize the
accelerators through graph partitioning: NEST-C uti- relationships between operators and tensors in a
lizes graph partitioning techniques to efficiently hardware-independent manner. For example, TVM’s
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARK ET AL. 853

relay IR [17] uses tensors and placeholders for data rep- computation from scheduling and is inspired by Halide.
resentation, thereby providing scalability for various TVM’s optimization process involves graph optimization,
operators. The back end uses low-level IRs, which con- operator-level optimization, and automatic tuning, facili-
tain more hardware specifications than the higher layers. tated by a machine learning-based system to identify the
Based on this information, it optimizes the code to optimal schedule among billions of possibilities. This
account for hardware-specific characteristics, calling approach enables efficient code generation for various
hardware-specific libraries where necessary to maximize targets including CPUs, GPUs, FPGAs, and ASICs.
execution efficiency. In addition, the back end translates TVM’s IR system, consisting of Relay and TIR, abstracts
the code into code that can run on actual hardware model operations and structures at different levels. This
(e.g., CPUs, GPUs, NPUs, and PIMs). In the back end, a enables precise optimizations and code generation tai-
low-level IR is utilized for hardware-dependent optimi- lored to the specific characteristics of the hardware.
zation and code generation for the target hardware. For Graph Lowering (Glow) , developed by Facebook, aims to
instance, TVM advances the approach based on Halide, optimize DNN models for efficient execution across dif-
and Glow optimizes tensor processing with command- ferent hardware platforms using a two-stage IR process.
based IR. The optimization involves high-level graph optimizations
followed by hardware-specific optimizations through
node lowering, focusing on minimizing memory con-
2.2 | Existing DL compilers sumption and maximizing execution speed. Glow’s IR
enables high-level graph optimizations and decomposes
Google’s XLA enhances TensorFlow by providing high-level operators into lower level linear algebra nodes,
hardware-agnostic and hardware-specific optimization enabling efficient execution on CPUs, GPUs, and ASICs.
through its compiler framework. It improves execution This process prioritizes model portability while optimiz-
speeds and memory usage in DL models using techniques ing performance and uses an “in-memory form” lower
such as operator fusion and buffer analysis. XLA offers level IR for hardware-dependent optimizations and mem-
both just-in-time (JIT) and ahead-of-time (AOT) compila- ory latency hiding.
tions, supporting diverse types of hardware such as CPUs All existing compilers were primarily designed for
and NVIDIA GPUs [18]. TVM, an open-source project, CPUs and GPUs, making it challenging to adapt them
employs a multi-layered optimization strategy that sepa- for the newly emerging NPUs and PIMs. The arrival of
rates computation from scheduling. It optimizes code complex and varied AI accelerators such as NPUs and
across CPUs, GPUs, FPGAs, and ASICs using its unique PIMs has significantly increased the difficulty of optimiz-
IR systems RelayIR and tensor IR. Meta’s Glow focuses ing edge devices that incorporate these accelerators.
on optimizing deep neural network models across various Moreover, they do not account for the complexities
platforms using a two-stage IR process [19], prioritizing involved in the simultaneous optimization and leveraging
efficient execution and model portability. It supports of the parallel capabilities of various AI accelerators. Dis-
CPUs and GPUs by optimizing memory usage and execu- tinct differences exist between the current compilers
tion speed. (TVM, Glow, and XLA) and NEST-C, as detailed in
Accelerated Linear Algebra (XLA) is developed by Google Table 1. NEST-C includes features in each optimization
for TensorFlow. It uses a compiler framework that per- category that are not supported by the existing compilers
forms hardware-agnostic high-level and hardware- and, notably, provides broader support a wider range
specific low-level optimizations. The high-level optimizer of NPUs.
(HLO) IR is used for graph-level optimizations, and
LLVM is used for code generation. XLA optimizes Ten-
sorFlow graphs using techniques such as operator fusion, 3 | NES T-C
common subexpression elimination, and buffer analysis,
resulting in improved execution speeds and memory 3.1 | NEST-C overview
usage for deep neural networks (DNNs). XLA optimizes
TensorFlow graphs using techniques such as operator This study devised NEST-C, a DL compiler designed to
fusion, common subexpression elimination, and buffer support various edge-specific AI accelerators, and
analysis, resulting in improved execution speeds and provides optimizations including graph partitioning,
memory usage for DNNs. quantization, and execution tuning. NEST-C supports
TVM is an open-source platform that introduces a multi- traditional processing units such as CPUs and GPUs
layered optimization approach. This approach separates while generating optimized code for AI accelerators such
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
854 PARK ET AL.

TABLE 1 Comparison of optimization features.

Optimization
feature TVM Glow XLA NEST-C
Quantization Calibration-based Int4, Profiling-based Int8, Profiling-based Int8, Profiling-based Int8, Int16,
Int8, and Fp16 Int16, and Fp16 Int16, and Fp16 and Fp16; layer-wise mixed
precision
Graph partitioning Memory-based static Static partitioning Static and dynamic Profiling-based dynamic
partitioning partitioning partitioning
NPU back-end VTA (open Habana (closed Google TPU EVTA, AimFuture NMP,
support architecture) architecture) OpenEdge Enlight, and SK
Hynix GDDR6-AiM
Auto-tune of tile Execution time Execution time profiling Auto-tuning and Hardware utilization,
size profiling, uniform tile hardware-specific profiling based, ununiform
size, and static tile optimization tile size, and dynamic tile
scheduling scheduling
Layer fusing and Static layer fusing and Static layer fusing and Operator fusion for Dynamic layer fusing and
execution layer-by-layer layer-by-layer execution optimized execution dynamic layer execution
execution
NN deployment on O
heterogeneous
accelerators on a
device
NN deployment on O O
multiple devices

PyTorch, into a graph IR. NEST-C front end, which is

fundamentally based on Glow’s IR structure, executes
basic optimizations (e.g., dead code elimination and
transpose-node optimizations), which are also applied in
Glow’s graph IR.
Secondly, NEST-C uniquely supports the characteris-
tics of resource-constrained edge AI accelerators through
its middle end, distinguishing it from traditional DL
compilers. The middle end performs hardware-
independent optimization and graph partitioning. After
receiving the graph IR from the front end, it subjects this
IR to post-training quantization, layer fusion, and opera-
tor scheduling. These processes, which are adaptable to
various AI accelerators, are executed before forwarding
an IR graph to the graph partitioner. Graph partitioning
then divides the computational graph of the model into
subgraphs, each allocated to the processing unit that best
matches its computational capacity, memory, and data-
transfer speed requirements. This strategy optimally
FIGURE 1 Architecture of the NEST-C ecosystem.
distributes model layers across hardware, enhancing exe-
cution efficiency, and facilitating parallel processing in
as NPUs and PIMs. Its architecture is divided into three edge environments.
main components: front end, middle end, and back end, Finally, in the back end, various hardware-dependent
as shown in Figure 1. optimizations such as execution tuning, latency hiding,
Initially, the front end converts the input from a DL and memory allocation are applied to the partitioned sub-
model, defined within a framework such as ONNX or graphs considering the characteristics of various edge AI
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARK ET AL. 855

accelerators. NEST-C employs a tensor IR to perform Detailed explanations of each configuration selected
hardware-dependent optimizations for each operator sep- by Algorithm 1 are provided below.
arately. This approach allows the fine-tuning of the per-
formance of DL models on AI accelerators by optimizing
memory usage, computational efficiency, and execution
flow based on the unique characteristics and capabilities
of each target device. The code generator creates execut-
able code by utilizing a variety of back-end code genera-
tors, including the “C code generator,” (CCodeGen)
“LLVM IR generator,” “relay IR generator,” and “NPU
code generator.” Each code generator produces code
optimized for specific processing units, which are eventu-
ally converted into executable files using compilers such
as GCC, LLVM, TVM relay, and those provided by each
back end.

3.2 | NEST-C middle end

3.2.1 | Post-training quantization and

optimal configuration procedures in NEST-C

Quantization converts a neural network’s FP32 precision

to INT8, thereby reducing its memory footprint and Scheme: To enhance code generation efficiency across
accelerating inference. This technique maps floating- various hardware platforms, we emphasize uniform inte-
point values to a narrower integer range while maintain- ger quantization, which incorporates four linear mapping
ing the performance with minimal loss in accuracy. This techniques: asymmetric, symmetric, symmetric with
enables the models to run efficiently on devices with lim- UINT8, and symmetric power2. The notations used in
ited computational resources. these equations are as follows. Qi8 represents the quan-
In NEST-C, as shown in Algorithm 1, to achieve tized 8-bit integer value. V fp32 denotes the floating-point
optimal quantization in terms of accuracy and latency, a 32-bit value. S is the scale factor. Z indicates the zero
variety of configurations are supported at the compiler point. V max and V min denote the maximum and mini-
level. Because quantization must be supported at the mum values, respectively. N symbolizes the bit
compiler level by default, the post-training quantization width. Asymmetric (affine mapping) converts the float
(PTQ) approach is followed. Similar to conventional ranges to ½2n1 , 2n1 1, optimally utilizing the
PTQ methods, NEST-C involves a profiling stage for cali- INT8 capacity. The quantization and dequantization pro-
bration before quantizing activations. Subsequently, cesses are defined by
based on the accuracy and latency requirements, config-

urations can be selectively created for the clipping, gran- V f 32 V max V min
Qi8 ¼ ROUND þZ , S¼ ,
ularity, and mixed-precision schemes. Algorithm 1 is S 2n 1
ð1Þ
implemented at the compiler level. Therefore, it takes F V min n1
as input, which is the graph IR of the DL model gener- Z ¼ ROUND 2 , V f 32 ¼ ðQi8 ZÞ S:
S
ated during the compilation phase. The output of
Algorithm 1 is an IR F ∗ that includes the quantization Symmetric maps real zeros to quantized zeros without
information. converting the FP32 range min and max, using the abso-
Quantization can reduce accuracy. To address this lute maximum for setting qmin and qmax. Its quantiza-
issue, the proposed algorithm sequentially applies tion and dequantization are
schemes such as clipping, granularity, and mixed preci-
V f 32 MAXðABSðV f32 ÞÞ
sion. If none of the configurations maintains the desired Qi8 ¼ ROUND , S¼ ,
S 2ðn1Þ 1 ð2Þ
accuracy, it is considered that PTQ methods alone cannot
V f 32 ¼ S Qi8 :
solve this problem, and an error is output. In such cases,
retraining the quantized model using external retraining The symmetric with UINT8 scheme blends
tools is necessary. asymmetric and symmetric methods, adapting to real-
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
856 PARK ET AL.

value distributions and toggling between symmetric Z = operation to y0 ¼ maxð0, w ∗ x þ bÞ by applying nonlinear-
0) and asymmetric (Z = 128) quantization as follows: ity immediately after convolution, thereby reducing com-
putational steps and memory overhead. Layer fusion,
V f 32 MAXðABSðV f32 ÞÞ
Qi8 ¼ ROUND þZ , S¼ , implemented as an optimization pass at the compiler
S 2n 1
: ð3Þ level, streamlines neural network operations by reducing
Z ¼ f 128, if V min ≥ 0, otherwise,
computational steps and memory usage, enhancing effi-
V f 32 ¼ ðQi8 ZÞ S ciency and speed, especially in memory-restricted
The symmetric with a power of two simplifies the environments.
hardware design using bit-shift operations instead of Operator scheduling: From the perspective of IR schedul-
multiplications and is defined by ing in neural network computation, the scheduling of
MAXðABSðV f ÞÞ
operators is crucial for optimizing performance. Transfor-
dlog2 e
S¼2 2ðn1Þ 1 : ð4Þ mations, such as in-place buffer modifications for
element-wise arithmetic, are key examples of such opti-
mizations. In this context, instructions manipulate global
Clipping: To mitigate accuracy loss without retraining, variables or locally allocated buffers, with each operand
the Glow compiler employs clipping to minimize the annotated using the qualifiers in, out, or inout. These
Kullback–Leibler divergence between the floating-point qualifiers indicate whether a buffer is read from (in),
and quantized distributions, addressing the impact of written to (out), or both (inout), indicating to the opti-
outliers in weight and activation distributions. mizer when optimizations such as copy elimination or
Granularity: The choice between tensor-wise and buffer sharing, are possible.
channel-wise quantization granularity balances accuracy
and latency, with fine granularity increasing computa-
tional demand, particularly in convolutions with diverse 3.2.3 | Graph partitioner (PartitionTuner)
weight values.
Mixed precision: NEST-C supports mixed-precision quan- Typically, edge AI accelerators support a limited range of
tization by maintaining the first and last layers at their DL operations, necessitating the partitioned execution of
original precision (FP32) for execution. The first and last DL models. Frameworks such as XLA and TVM provide
layers are known to be the most sensitive to quantiza- code generation for accelerators but are limited to sup-
tion [20] and are prioritized for quick mixed-precision porting only a single accelerator. To address this issue,
decisions. For additional mixed-precision applications, NEST-C integrates PartitionTuner [21], which distributes
developers must determine the layers that should be the operations of a DL model across multiple accelerators
quantized. and synchronizes the processing of their operations to
enhance the overall performance. Figure 2 illustrates the
architecture of this approach.
3.2.2 | Hardware-independent optimizations Initially, the Branch Extractor analyzes the structure
of the input graph to identify branches that require
Hardware-agnostic optimization focuses on the structural sequential execution. Subsequently, the Graph Parti-
and algorithmic characteristics of DL models. This tioner segments the graph into groups of operations for
approach performs optimizations that can generally each branch. The Profiler then generates a machine
enhance performance across all types of hardware. In
addition to the PTQ techniques previously mentioned
(Section 3.2.1), the key hardware-agnostic optimization
techniques applied in NEST-C include layer fusion and
operator scheduling.
Layer fusion: Fusion, such as combining convolution with
batch normalization and ReLU activation, serves to
increase computational efficiency and minimize memory
usage. The convolution operation, denoted by
y ¼ w ∗ x þ b, when fused with batchpnormalization,
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi trans-
forms into y0 ¼ γ ½ðw ∗ x þ b μÞ= ð σ 2 þ ϵÞ þ β, stream-
lining the computation by merging the standardization
directly into the convolution process. Similarly, fusing
convolution with ReLU activation simplifies the FIGURE 2 Workflow of the PartitionTuner in NEST-C.
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARK ET AL. 857

code for each back end identified by the Back-End inputs from memory are partially stored, with outputs
Finder and profiles the execution times for each group accumulated as the computations proceed. This necessi-
of operations. Based on these profiling results, a user- tates diverse scheduling strategies such as input station-
modifiable Partition Policy file is automatically gener- ary, weight stationary, and output stationary [22].
ated. The Partition Scheduler creates partitions based “Tiling” involves storing and processing inputs and out-
on the Partition Policy using the back-end information puts in the buffer sequentially; an NPU requires all
assigned to the operations. It then develops a Schedul- inputs for a tile to be loaded and space for outputs to be
ing Plan that specifies the execution sequence among available before processing begins.
the partitions. Subsequently, based on the Scheduling The period an NPU waits to begin operation is called
Plan, the Partition Tuner generates Scheduling Code for the “memory latency,” and strategies such as adjusting
each back end. This approach accurately identifies the the size of tiles and employing double buffering are used
fastest optimal partition for each accelerator, ensuring to reduce this time, a practice known as “memory latency
the optimal utilization of various accelerators within hiding.” NEST-C provides three main optimization
heterogeneous computing systems. Additionally, the approaches to maximize memory latency hiding and
overall execution speed can be significantly improved determine the optimal tile size and scheduling.
by employing independently operable accelerators in Empirical discovery through device profiling: This involves
parallel. experimenting on actual devices to find optimal settings.
However, an exhaustive search of all the possible config-
urations can be time-consuming or inefficient.
3.3 | NEST-C back end Auto-tuning using machine learning: To mitigate the chal-
lenges of exhaustive search, NEST-C employs auto-tuning
3.3.1 | Tensor IR and hardware-dependent techniques that leverage machine learning to streamline
optimization the optimization process, which significantly reduces the
search space.
During graph partitioning, the hardware on which the Utilization of performance modeling or performance accu-
partitioned graph will execute is determined, and based rate simulators: This method involves using performance
on this specific hardware, the IR is lowered to the tensor- models or simulations to search all cases. If a model or
IR level. At this level, the memory size and location of simulator is implemented precisely, the optimal execu-
the inputs and outputs used by each operator are deter- tion method can be identified quickly.
mined. This phase aims to find the optimal execution
code that minimizes memory usage while improving exe-
cution performance. NEST-C can dynamically allocate 3.3.2 | Code generator
the main memory with the help of the operating system
depending on the target hardware. However, this method The code generator creates code tailored to the target
introduces overhead for allocation and deallocation. Fur- hardware based on the optimized tensor IR. Depending
thermore, if the DL tensor data are distributed across var- on the target hardware, it may need to generate binary
ious memory areas, it could degrade the performance of machine code or code that conforms to the device driver
the NPU’s direct memory access. or inference library interface of the target hardware.
Therefore, NEST-C calculates the total memory size Alternatively, the tensor IR could be lowered to an LLVM
required to execute the operations of a specific partition IR via an LLVM code generator.
from beginning to end. Subsequently, it allocates that It is necessary to implement a unique code generator
amount of memory—preferably a contiguous block— for each target hardware, and NEST-C provides refer-
just once. It then devises a strategy to statically reuse ences that can be consulted when developing code gener-
memory within that allocated space without further ators for new hardware. Additionally, NEST-C offers
assistance from the operating system. This approach functionality that not only generates code for several pop-
reduces memory fragmentation and allocation overhead, ular models (e.g., ImageNet classification) but also auto-
ensuring efficient memory management and optimized matically creates code for input preprocessing and output
performance for the DL computations on the targeted presentation. The code generator further includes various
hardware. back-end generators, such as C, NPU, and EVTA code
NPUs have small internal buffers so that they can generators. The following section details the integration
deliver data faster than DRAM; however, they cannot of the code generator with diverse AI accelerators across
store all inputs and outputs simultaneously. Instead, each multi-level IR.
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
858 PARK ET AL.

4 | DL COMPILER developers to support heterogeneous devices. CCodeGen

IMPLEMENTATION A ND parses the input DL model to generate a dataflow graph
O P T I M I Z A T I O N F O R DI V E R S E (DFG), which encompasses the operations, input/output
TARGET HARDWARE information, and parameters of the DL model. Next, using
the DFG, it generates “inference.c” and “inference.h,” C
NEST-C enables the export of DL models to an LLVM IR, code files that contain inference functions, and “weight.
thereby supporting a wide range of general-purpose pro- bin,” a binary file that contains the trained weights used
cessors that LLVM supports, including prominent CPU for inference. Figure 3 illustrates an example of the opera-
architectures from Intel and ARM. However, NEST-C tion library. The API is defined using C++ template func-
primarily targets accelerators (NPUs) specifically devel- tions to ensure that the operation functions are not
oped for DL computations, and the compiler implemen- dependent on the variable types (such as float, INT16,
tation can be categorized into three forms based on the and INT8). Additionally, it allows users to select the type
hardware characteristics and interface levels provided by of compiler provided by the device (GCC or LLVM), auto-
these NPUs, as illustrated in Figure 1. matically handling any differences in syntax between
Using only the front end : This approach involves gen- compilers. Moreover, it includes the option to decide on
erating code directly from the initially created graph IR the use of different libraries, such as OpenBLAS.
without any optimization process such as C code genera-
tion (Section 4.1).
Utilizing the middle end : This involves performing 4.2 | Graph-IR implementation
optimizations such as PTQ and layer fusion, generating a
partitioned graph, and then using the target hardware’s Chip vendors provide their own back-end compilers for
development toolkit for the remaining compilation pro- commercial-grade AI semiconductors. Therefore, to
cess such as OpenEdge Enlight [9] (Section 4.2). reduce the development effort and generate high-quality
Employing the full stack of NEST-C : This approach final target code, it is necessary to integrate NEST-C with
generates executable final code for the target hardware compilers provided by chip vendors. NEST-C offers
directly from the optimized tensor IR that has undergone ONNX conversion functionality for integration with pri-
all of NEST-C’s processes and optimizations vate back-end compilers at the graph-IR level. As
(e.g., AimFuture’s NMP [8]). explained previously, hardware-independent optimiza-
The development and verification of the full NEST-C tions are performed at graph-IR level. Then, the graph IR
stack were initially performed using EVTA (Section 4.4), is serialized and stored according to the YAML meta-
which served as a reference NPU. AiMFuture NMP format. After normalizing the stored meta file to align
(Section 4.3) undergoes a compilation process similar to with the ONNX format, an ONNX suitable for input to
that of EVTA. OpenEdge Enlight (Section 4.2) imple- the private compiler provided by commercial-grade AI
ments the compiler up to the middle end of NEST-C. semiconductor manufacturers is generated. As shown in
Additionally, implementations utilizing NEST-C include Figure 1, for the conversion from graph IR to ONNX,
SK Hynix’s GDDR6-AiM [13] and TVM’s Relay, showcas- NEST-C includes four functional modules: Partition to
ing the flexibility and adaptability of NEST-C to a broad YAML, ONNX IR Legalizer, Sanity Checker with ONNX-
spectrum of computational platforms for DL. Runtime, and Partial ONNX Exporter.
Partition to YAML: This module divides the input model
into partitions of the desired size. A partition can
4.1 | CCodeGen: C code generation

The CCodeGen automatically generates C/C++ code

compatible with common cross-compilers, such as GCC
and LLVM directly from the output of the DL framework.
This is important when deploying DL models on hetero-
geneous computing systems. In heterogeneous computing
systems with varying requirements, developers must
manually compile, translate, and deploy DL models into
machine code or back-end inputs for each hardware type.
This process places a heavy burden on developers. There-
fore, CCodeGen generates C/C++ code that can be used
by most computing systems, making it easier for FIGURE 3 API and code example of the operation library.
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARK ET AL. 859

represent either an entire model or a specific group of numbers within the fixed-point integer constraints of the
consecutive operators. The generated YAML file includes NMP, thereby optimizing both the accuracy and effi-
information regarding each partition and the names of ciency of DL computations.
the operators used in the IR graph. This information is Back-end optimization for NMP Because each processing
utilized for code generation tailored to the partitioning or unit in NMP possesses its own buffer and accelerator,
integration with specific back-end compilers. operations must be well tiled and distributed, similar to
ONNX IR Legalizer: This module performs normalization EVTA. Whereas EVTA determines the optimal tile size
to resolve representation differences between the graph and number of buffers by profiling the execution perfor-
IR and the ONNX IR. This step is necessary for convert- mance on real devices considering only output stationary
ing the graph IR into ONNX IR while ensuring consis- scheduling, NMP considers input stationary, weight sta-
tency in the representation. tionary, and output stationary scheduling. It uses a per-
Sanity Checker with ONNXRuntime: To verify that the formance model to exhaustively search for and identify
model converted to ONNX functions correctly, this mod- the best performance across all scenarios. This compre-
ule uses ONNXRuntime, a compiler for general-purpose hensive approach enables NMP to optimize the resource
CPUs, to validate the model’s results. allocation and processing efficiency for diverse computa-
Partial ONNX Exporter: This module converts and stores tional patterns in DL operations.
the YAML meta file in the ONNX format. This step con-
verts the metafile, after hardware-independent optimiza-
tion, to the final ONNX format. 4.4 | EVTA
Using this structure, NEST-C partitions and normal-
izes the input model before generating the final ONNX EVTA is a custom NPU based on the VTA [23] from the
format model. This model undergoes hardware- Apache TVM project. It modifies VTA’s compute module
independent optimization and is integrated with the to include an MOV instruction, reducing power usage
Enlight compiler, enabling hardware-accelerated infer- and improving performance by eliminating DRAM data
ence. This approach also allows for the integration of a transfers between the output and input buffers. Moreover,
third-party private compiler. EVTA supports diverse operations (INT8, FP16, FP32,
and binary) and allows multiple NPUs to work together
and share DRAM resources, as illustrated in Figure 4.
4.3 | Tensor-IR implementation EVTA’s IR is extended within the structure of NEST-
C’s graph IR and tensor IR to meet the hardware charac-
The neuromorphic processor (NMP) is an embedded teristics and optimization requirements.
evaluation board developed by LG Electronics centered Graph-IR and middle-end optimization At the middle end
on a novel architecture designed to facilitate efficient DL of NEST-C, efforts are made to minimize DRAM access
operations. The foundational principle of the NMP’s and maximize hardware utilization by changing the data
design involves leveraging RISC-V instruction set archi- layout according to the DL operators and performing
tecture extensions to create specialized instructions for layer fusion optimizations. The compute module of
various CNN components, including convolutional EVTA can process ReLU operations in conjunction with
layers, fully connected layers, pooling layers, and matrix multiplication; hence, ReLU is fused with convo-
element-wise operations. The architecture of the NMP lution and fully connected operators. Because EVTA can
features a multicore NPU comprising several processors.
Each processor houses multiple processing units, each
equipped with a RISC-V core, a multiply-accumulate
unit, and memory buffers to support the computational
demands of neural network processing.
Quantization for NMP NMP is a hardware accelerator
that supports only integers in Q-format. Therefore,
NEST-C middle end for NMP must ensure that all DL
models are quantized into Q-format. For Q-format quan-
tization, the minimum and maximum values of all ten-
sors holding floating-point values are calculated, and
NEST-C determines the most suitable format, either
INT8 or INT16, as supported by NMP. This process
allows the precise representation of floating-point FIGURE 4 EVTA architecture and configuration.
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
860 PARK ET AL.

only perform matrix multiplication during the middle- experiment to demonstrate the reduction in latency and
end stage, it transforms a 4-dimensional input into a improved computational efficiency when using both CPU
6-dimensional one, converting the last two dimensions and NPU resources with NEST-C. Next, to verify the
into a matrix size that EVTA can process in one cycle. portability of NEST-C, we tested it with two types of com-
Additionally, EVTA’s graph IR was implemented to rep- mercial AI accelerators: AimFuture NMP for the tensor-
resent such a six-dimensional data layout. Following this, IR interface and OpenEdge Enlight for the graph-IR
graph partitioning is performed according to the configu- interface. These experiments encompassed a variety of
ration of multi-EVTA, and it is then lowered to tensor IR. DL models trained on ImageNet [24], including GoogLe-
Tensor IR and back-end optimization In NEST-C back Net [25], ResNeXt50 [26], ResNet50 [27], ResNet18 [27],
end, as the graph IR is lowered to tensor IR, the execu- MNIST [28], LeNet [29], MobileNetV2 [24], and Squeeze-
tion order of each operator and the DRAM storage loca- Net [30]. Considering the diversity in the types of opera-
tions for the operator’s input and output data are tions supported by each AI accelerator, distinct DL
determined. For EVTA, the inputs must be tiled because models were chosen for each experiment.
the buffer size is limited. Double buffering is utilized to
maximize memory latency hiding. The extent of latency
hiding varies with the tile size and number of buffers, 5.1 | EVTA with PartitionTuner
and NEST-C employs profiling and auto-tuning to deter-
mine the optimal tile size and number of buffers. The Xilinx ZCU102 platform was used to evaluate the
Codegen NEST-C provides the EVTA execution library performance of employing EVTA in NEST-C.
along with the device driver, featuring an interface in the The ZCU102 provides a quad-core ARM Cortex-A53 CPU
format of Figure 5. NEST-C generates the code according and FPGA. Four EVTAs were ported to the FPGA on the
to the execution library interface. For convolution opera- Xilinx ZCU102 platform to run at 333 MHz. CCodeGen
tions, the arguments also include optimized tile sizes and was used to generate the machine code for the CPU. The
the number of buffers. The generated code is designed to model was partitioned by CPU and NPU using Partition-
allow the device driver to parse the arguments at runtime Tuner, which also performed the quantization for the
and immediately execute the code. EVTAs. Table 2 shows a significant reduction in latency
NEST-C is tasked with handling a large number of when NPUs are used in conjunction with the CPU
tensor data and deep layers. The code generated by instead of using the CPU alone. However, increasing the
NEST-C computes each layer, and the results are tem- number of NPUs does not significantly reduce latency.
porarily stored in DRAM. To ensure that the EVTA This is because quantization and data dimension trans-
developed using NEST-C is error free, it is necessary to formation operations required to use NPUs are executed
validate the computational results with respect to the on the CPU, which is much slower than the NPUs. The
expected values. For this purpose, NEST-C supports a parallel execution of these operations by two NPUs and
debugging mode that allows the temporary results the CPU limits fully parallel processing. Therefore, to
stored in DRAM for each layer to be outputted to a file. improve the performance of multiple NPUs, it is impor-
The results of the operations may differ slightly, tant to partition and schedule the model.
depending on the hardware and data type. NEST-C
accommodates these differences by allowing the toler-
ance level to be set. TABLE 2 Latency (ms) of the DL models on multi-EVTA.

Baseline NEST-C (EVTA)

5 | EXPERIMENTS A ND CPU CPU + NPU CPU + 2NPUs
EVALUATION Resnet18 1262.45 247.41 125.09
Resnet50 3180.52 654.33 580.38
The performance of AI accelerators and heterogeneous
ResNext50 6717.05 1645.90 756.45
systems using NEST-C was evaluated in three experi-
ments. First, we conducted an EVTA with PartitionTuner GoogLeNet 1455.75 497.52 287.94
SqueezeNet 297.56 175.98 97.91
AlexNet 884.60 789.10 -
EfficientNet 6178.70 2820.72 -
ZFNet512 1673.95 1662.54 -
MNasNet 1258.35 866.04 -
FIGURE 5 EVTA execution library interface for convolution.
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARK ET AL. 861

5.2 | Tensor-IR interface evaluation Mish. Additionally, pooling operations are supported
with configurations of 4:1 (2 2) and Stride 2, which are
To assess the performance of AI accelerators at the available in max/average and global average pooling
tensor-IR level in NEST-C, the NMP [22] board provided modes.
by AimFuture was utilized. Various DL models were The experimental results, as shown in Figure 6A,
evaluated, including MNIST, LeNet, ResNet18, ResNet50, indicate a decrease in accuracy when models are exe-
SqueezeNet, and Inception. Performance evaluations for cuted on OpenEdge Enlight. This decrease was due to the
each DL model within NEST-C were conducted and com- quantization process, which reduced the model precision
pared with the performance data from XLA provided by from 32 to 8 bits. Despite this reduction, the models
NMP. The model accuracy provided by Google Tensor- maintained acceptable performance.
Flow was used as baseline. Table 3 shows that most of In terms of the inference latency, as depicted in
the reference models exhibited results similar to the base- Figure 6B, the integration of NEST-C and OpenEdge
line accuracy, suggesting that NEST-C was effectively Enlight achieved impressive inference times of 9 ms for
implemented at the tensor-IR level within NMP. Unlike MobileNetV2 and 107 ms for Resnet50. This performance
NMP’s XLA, which is optimized directly by the hardware is significantly faster than the inference times on an Intel
manufacturer, NEST-C automatically performs optimiza- i7 CPU, which are 59.9 ms for MobileNetV2 and
tion at the compiler level through hardware profiling. 136.9 ms for Resnet50. It is also comparable to the results
Despite this, the latency performance of NEST-C was on an NVIDIA 2080ti GPU, which are 15.35 ms for Mobi-
very close to that of NMP’s XLA, indicating its ability to leNetV2 and 14.8 ms for Resnet50.
maintain optimal performance even when new AI accel-
erators were integrated.
5.4 | Discussion and limitations

5.3 | Graph-IR interface evaluation The experimental results demonstrated the effectiveness
and versatility of the NEST-C framework across various
The proposed common ONNX-based interface was used
to experimentally validate the successful linkage between
the general AI compiler and the private NPU compiler.
For a comparative evaluation, the results generated by
integrating Enlight and NEST-C were compared with the
performance results of Resnet50 and MobileNetV2 DL
models on general hardware CPUs and GPUs supported
by the existing NEST-C.
OpenEdge Enlight supports various layer types,
including convolution layers with kernel sizes ranging
from 1 1 to 7 7 and strides from 1 to 4. The depth-wise (A) (B)
convolution layer supports a 3 3 kernel with strides
ranging from 1 to 2. The supported activation functions F I G U R E 6 Accuracy and latency of the models on the targets
include Bypass, ReLU, Leaky ReLU, Sigmoid, Tanh, and (Intel i7-8700 CPU, 2080ti GPU, and U280 FPGA-based NPU).

TABLE 3 Accuracy and latency of the DL models on NMP.

NEST-C (NMP) XLA (NMP)

TensorFlow
Accuracy Latency Accuracy Latency Accuracy
Top 1 (ms) Top 1 (ms) Top 1
MNIST 98.2 0.163 - - 98.90
LeNet 94.8 0.635 99.9 0.199 94.80
Resnet18 66.8 22.477 - - 69.93
Resnet50 71.3 67.385 70.7 55.927 74.93
SqueezeNet 48.7 14.269 47.1 12.504 49.00
Mobilenet 70.2 70.077 70.9 14.03 71.80
Inception 74.2 30.59 76.9 72.686 77.90
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
862 PARK ET AL.

AI accelerators, showing significant improvements in Jinse Kwon https://orcid.org/0000-0003-3091-9926

latency reduction and computational efficiency. The por- Junmo Park https://orcid.org/0000-0002-8500-8874
tability of NEST-C was validated through tests using Jemin Lee https://orcid.org/0000-0002-9332-3508
AimFuture NMP and OpenEdge Enlight, which main- Yongin Kwon https://orcid.org/0000-0003-2973-246X
tained optimal performance across different hardware
platforms. However, owing to the constraints of using RE FER EN CES
commercial hardware, we were limited to experimenting 1. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G.
with DL models provided by hardware manufacturers, Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A.
which restricted the range of models tested. Additionally, Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A.
there is a potential accuracy loss owing to quantization Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S.
Chintala, PyTorch: an imperative style, high-performance deep
from 32 to 8 bits, and further evaluation is needed to
learning library, (Proc. 33rd Int. Conf. Neural Inf. Process.
understand the scalability of NEST-C with large models Syst., Vol. 32, Curran Associates Inc., Red Hook, NY, USA),
and datasets. Addressing these limitations in future stud- 2019.
ies will enhance the robustness and efficiency of NEST-C 2. ONNX Contributors, Open Neural Network Exchange
for diverse AI applications. (ONNX), 2024. https://github.com/onnx/onnx,_2024. Accessed:
2024-03-18.
3. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M.
Devin, S. Ghemawat, G. Irving, and M. Isard, TensorFlow: a
6 | C ON C L U S I ON S
system for large-scale machine learning, (12th USENIX Symp.
Operating Syst. Des. Implementation (OSDI’16)., Savannah,
This study developed NEST-C, a novel DL compiler GA, USA), 2016, pp. 265–283.
framework designed to improve the deployment and per- 4. M. Li, Y. Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, L.
formance of DL models across various AI accelerators. Gan, G. Yang, and D. Qian, The deep learning compiler: a com-
The experimental results demonstrate that NEST-C sig- prehensive survey, IEEE Trans. Parallel Distrib. Syst. 32 (2020),
nificantly enhances computational efficiency and adapt- no. 3, 708–727.
ability, achieving higher throughput and lower latency 5. T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M.
Cowan, L. Wang, Y. Hu, and L. Ceze, TVM: an automated
through profiling-based quantization, dynamic graph par-
end-to-end optimizing compiler for deep learning, (13th USE-
titioning, and multi-level IR integration. Despite the limi- NIX Symp. Operating Syst. Des. Implementation (OSDI’18),
tations of testing with a limited range of DL models Carlsbad, CA, USA), 2018, pp. 578–594.
owing to commercial hardware constraints, NEST-C 6. N. Rotem, J. Fix, S. Abdulrasool, G. Catron, S. Deng, R.
maintained its optimal performance across different hard- Dzhabarov, N. Gibson, J. Hegeman, M. Lele, and R.
ware platforms. Despite being constrained to a limited Levenstein, Glow: graph lowering compiler techniques for neu-
range of DL models provided by hardware manufac- ral networks, arXiv preprint, 2018. https://doi.org/10.48550/
arXiv.1805.00907
turers, NEST-C demonstrated its potential for broader
7. C. Leary and T. Wang, XLA: TensorFlow, compiled, 2017. Ten-
applicability.
sorFlow Dev Summit.
Future work will focus on addressing the current 8. AiM Future, The future of artificial intelligence: AiM future’s
limitations by extending the range of tested models, mini- product lineup, 2023. https://aimfuture.ai. Accessed: 2024-03-22.
mizing accuracy loss due to quantization, and evaluating 9. OPENEDGES, Neural processing unit (NPU) IP—ENLIGHT,
the scalability of NEST-C with larger models and data- 2022. URL https://www.openedges.com/npu. Accessed:
sets. By overcoming these challenges, we aim to further 2024-03-22.
enhance the robustness and efficiency of NEST-C, ensur- 10. J.-W. Jang, S. Lee, D. Kim, H. Park, A. S. Ardestani, Y. Choi,
C. Kim, Y. Kim, H. Yu, H. Abdel-Aziz, J.-S. Park, H. Lee, D.
ing its adaptability to a wider array of hardware platforms
Lee, M. W. Kim, H. Jung, H. Nam, D. Lim, S. Lee, J.-H. Song,
and DL tasks. Additionally, NEST-C was primarily S. Kwon, J. Hassoun, S. Lim, and C. Choi, Sparsity-aware and
designed to optimize DNNs used for image recognition. re-configurable NPU architecture for Samsung Flagship Mobile
Therefore, we plan to extend its capabilities to support SoC, (ACM/IEEE 48th Annu. Int. Symp. Comput. Archit.,
large-language models and transformer architectures to Valencia, Spain), 2021, pp. 15–28.
suit the evolving DL research environments. 11. Qualcomm, Unlocking on-device generative AI with an NPU
and heterogeneous computing, 2024. https://www.qualcomm.
CONFLICT OF INTEREST STATEMENT com. Accessed: 2024-03-22.
12. Apple, Deploying transformers on the Apple Neural Engine,
The authors declare that there are no conflicts of interest.
2023. https://machinelearning.apple.com/research/deploying-
transformers-on-the-apple-neural-engine. Accessed: 2024-
ORCID 03-22.
Jeman Park https://orcid.org/0009-0002-9524-0738 13. Y. Kwon, K. Vladimir, N. Kim, W. Shin, J. Won, M. Lee, H.
Misun Yu https://orcid.org/0000-0001-7319-1053 Joo, H. Choi, G. Kim, and B. An, System architecture and
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARK ET AL. 863

software stack for GDDR6-AiM, (IEEE Hot Chips 34 Symp., model size, arXiv preprint, 2016. https://doi.org/10.48550/
Cupertino, CA, USA), 2022, pp. 1–25. arXiv.1602.07360
14. Samsung, HBM-PIM: cutting-edge memory technology to accel-
erate next-generation AI, 2023. https://semiconductor.samsung.
com/. Accessed: 2024-03-18. AUTHOR BIOGRAPHIES
15. ETRI, NEST-C. https://gitlab.com/ones-ai/nest-compiler.
Accessed: 2024-03-22. Jeman Park received his BS, MS,
16. C. Lattner and V. Adve, LLVM: a compilation framework for
and PhD degrees in Electronics and
lifelong program analysis & transformation, (Int. Symp. Code
Gener. Optim., San Jose, CA, USA), 2004, pp. 75–86.
Computer Engineering from Hanyang
17. J. Roesch, S. Lyubomirsky, L. Weber, J. Pollock, M. Kirisame, University, Republic of Korea, in
T. Chen, and Z. Tatlock, Relay: a new IR for machine learning 2004, 2006, and 2014, respectively. He
frameworks, (Proc. 2nd ACM SIGPLAN Int. Workshop Mach. is a senior researcher at the Electron-
Learn. Program. Lang., Association for Computing Machinery, ics and Communications Research
Philadelphia, PA, USA), 2018, pp. 58–68. Institute, Daejeon, Republic of Korea. His research
18. J. Dean, Machine learning for systems and systems for machine interests include computer networks, edge computing,
learning, Presentation at 2017 Conf. Neural Inf. Process. Syst.,
and deep learning compilers.
Curran Associates, Long Beach, CA, USA, 2017.
19. Meta, Glow’s Graph IR optimization. https://github.com/ Misun Yu received the MS degree
pytorch/glow/blob/master/docs/Optimizations.md. Accessed: from the Department of Computer
2024-02-22. Science and Engineering at Pohang
20. J. Lee, M. Yu, Y. Kwon, and T. Kim, Quantune: Post-training
University of Science and Technol-
quantization of convolutional neural networks using extreme
gradient boosting for fast deployment, Future Gener. Comput.
ogy, Republic of Korea. She is a prin-
Syst. 132 (2022), 124–135. cipal researcher at the Electronics
21. Y. Misun, K. Yongin, L. Jemin, P. Jeman, P. Junmo, and K. and Communications Research Insti-
Taeho, PartitionTuner: an operator scheduler for deep-learning tute Daejeon, Republic of Korea. Her main research
compilers supporting multiple heterogeneous processing units, interests include concurrent program analysis, soft-
ETRI J. 45 (2023), no. 2, 318–328. ware testing, deep learning, and embedded systems.
22. R. Sousa, M. Pereira, Y. Kwon, T. Kim, N. Jung, C. S. Kim,
M. Frank, and G. Araujo, Tensor slicing and optimization for Jinse Kwon received MS and PhD
multicore NPUs, J. Parallel Distrib. Comput. 175 (2023), degrees in Computer Science and
66–79. Engineering from Chungnam
23. T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, J. National University, Daejeon, Repub-
Fromm, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy, lic of Korea in 2017 and 2024, respec-
A hardware-software blueprint for flexible deep learning special-
tively. He is currently a researcher at
ization, IEEE Micro 39 (2019), no. 5, 8–16.
24. J. Deng, W. Dong, and R. Socher, ImageNet: a large-scale hier-
the Electronics and Telecommunica-
archical image database, (IEEE Conf. Comput. Vision Pattern tions Research Institute, Daejeon, Republic of Korea.
Recognit., Miami, FL, USA), 2009, pp. 248–255. His research interests include deep learning compilers
25. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and on-device computing.
D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper
with convolutions, (Proc. IEEE Conf. Comput. Vision Pattern
Junmo Park received the BS degree
Recognit. (CVPR), Boston, MA, USA), 2015, pp. 1–9. in Computer Science from Kwang-
26. S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated woon University, Seoul, Republic of
residual transformations for deep neural networks, (IEEE Conf. Korea in 2012 and the MS degree
Comput. Vision Pattern Recognit. (CVPR), Honolulu, HI, from the Graduate School of Conver-
USA), 2017, pp. 1492–1500. gence Science and Technology at
27. K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for
Seoul National University, Republic
image recognition, (IEEE Conf. Comput. Vision Pattern Recog-
of Korea, in 2020. He joined Samsung Electronics in
nit. (CVPR), Las Vegas, NV, USA), 2016, pp. 770–778.
28. L. Deng, The MNIST database of handwritten digit images for Hwaseong, Republic of Korea, in 2012, where he has
machine learning research, IEEE Signal Process. Mag. 29 been involved in compiler optimization and develop-
(2012), no. 6, 141–142. ment. Since 2020, he has been working as a Principal
29. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based Software Engineer on mobile GPU compilers. His
learning applied to document recognition, Proc. IEEE 86 (1998), research interests include deep learning, compilers,
no. 11, 2278–2324. embedded systems, HW/SW co-design, and
30. F. N. Iandola, S. Han, and M. W. Moskewicz, SqueezeNet: Alex-
optimization.
Net-level accuracy with 50x fewer parameters and <0.5MB
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
864 PARK ET AL.

Jemin Lee received his BS and PhD Engineering from Seoul National University, Republic
degrees in Computer Science and of Korea, in 2010 and 2015, respectively. From 2015 to
Engineering from Chungnam 2019, he worked for Samsung Electronics as a Staff
National University, Daejeon, Software Engineer. Since 2019, he has been with the
Republic of Korea, in 2011 and 2017, Electronics and Telecommunications Research Insti-
respectively. He is currently a senior tute, Daejeon, Republic of Korea, where he is cur-
researcher at the Electronics and rently a senior researcher. His research interests
Telecommunications Research Institute, Daejeon, include neural processing units, compilers, deep
Republic of Korea. Since 2023, he has also served as learning, and embedded systems.
an assistant professor in the AI Department at the
University of Science and Technology, Daejeon,
Republic of Korea. He was a postdoctoral researcher
at the Korea Advanced Institute of Science and Tech- How to cite this article: J. Park, M. Yu, J. Kwon,
nology, Daejeon, Republic of Korea from 2017 to J. Park, J. Lee, and Y. Kwon, NEST-C: A deep
2018. His research interests include energy-aware learning compiler framework for heterogeneous
mobile computing and deep learning compilers. computing systems with artificial
intelligence accelerators, ETRI Journal 46 (2024),
Yongin Kwon received the BSc
851–864, DOI 10.4218/etrij.2024-0139
degree in Electrical and Electronic
Engineering from the Korea
Advanced Institute of Science and
Technology, Daejeon, Republic
of Korea, in 2008, and MS and PhD
degrees in Electrical and Computer

The Deep Learning Compiler: A Comprehensive Survey
No ratings yet
The Deep Learning Compiler: A Comprehensive Survey
20 pages
Embedded Deep Learning Accelerators Survey
No ratings yet
Embedded Deep Learning Accelerators Survey
19 pages
Embedded Deep Learning Accelerators A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators A Survey On Recent Advances
19 pages
Deep Learning Hardware Accelerators Survey
No ratings yet
Deep Learning Hardware Accelerators Survey
53 pages
V Ersion: A Survey On Deep Learning Hardware Accelerators For Heterogeneous HPC Platforms
No ratings yet
V Ersion: A Survey On Deep Learning Hardware Accelerators For Heterogeneous HPC Platforms
58 pages
Full Stack Optimization of Transformer Inference A Survey
No ratings yet
Full Stack Optimization of Transformer Inference A Survey
45 pages
A Survey On Deep Learning Hardware Accelerators For
No ratings yet
A Survey On Deep Learning Hardware Accelerators For
54 pages
Osdi18 Chen
No ratings yet
Osdi18 Chen
17 pages
Designing RISC-V Instruction Set Extensions For Artificial Neural Networks An LLVM Compiler-Driven Perspective
No ratings yet
Designing RISC-V Instruction Set Extensions For Artificial Neural Networks An LLVM Compiler-Driven Perspective
22 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
Tech Seminar - 1JT19CS076@Jyothyit - Ac.in NIKITHA
No ratings yet
Tech Seminar - 1JT19CS076@Jyothyit - Ac.in NIKITHA
32 pages
Deep Learning Framework
No ratings yet
Deep Learning Framework
25 pages
TVM: Optimizing Compiler for Deep Learning
No ratings yet
TVM: Optimizing Compiler for Deep Learning
16 pages
Optimizing Hardware for Deep Learning
No ratings yet
Optimizing Hardware for Deep Learning
48 pages
5th AccML Paper 1
No ratings yet
5th AccML Paper 1
6 pages
TensorFlow Overview and Release History
No ratings yet
TensorFlow Overview and Release History
12 pages
DLBench A Comprehensive Experimental Evaluation of
No ratings yet
DLBench A Comprehensive Experimental Evaluation of
23 pages
Deep Learning Blog
No ratings yet
Deep Learning Blog
6 pages
Intel® Ngraph™
No ratings yet
Intel® Ngraph™
3 pages
Eeb131 Intro To Ai and It-03
No ratings yet
Eeb131 Intro To Ai and It-03
23 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
No ratings yet
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
4 pages
Tensor Flow
No ratings yet
Tensor Flow
19 pages
Reconfigurable Distributed FPGA Cluster Design For Deep Learning Accelerators
No ratings yet
Reconfigurable Distributed FPGA Cluster Design For Deep Learning Accelerators
5 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
A Survey of Performance Modeling and Simulation Techniques For Accelerator-Based Computing PDF
No ratings yet
A Survey of Performance Modeling and Simulation Techniques For Accelerator-Based Computing PDF
10 pages
Tensor Flow
No ratings yet
Tensor Flow
14 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
A Survey of Performance Modeling and Simulation Techniques For Accelerator-Based Computing
No ratings yet
A Survey of Performance Modeling and Simulation Techniques For Accelerator-Based Computing
10 pages
Pytorch Paper
No ratings yet
Pytorch Paper
12 pages
HoloLens-Based Satellite Propulsion Assembly
No ratings yet
HoloLens-Based Satellite Propulsion Assembly
8 pages
15 Marks Questions in 2023
No ratings yet
15 Marks Questions in 2023
27 pages
Exams Collection: Exams4Collection Exam Dumps & Exams4Collection Exam Study Material
No ratings yet
Exams Collection: Exams4Collection Exam Dumps & Exams4Collection Exam Study Material
4 pages
1-Elements of ECDIS
No ratings yet
1-Elements of ECDIS
24 pages
Lab Guide - PDF - EN
No ratings yet
Lab Guide - PDF - EN
174 pages
PCMCIA Card Setup Guide
No ratings yet
PCMCIA Card Setup Guide
2 pages
Maharashtra State Board of Technical Education, Mumbai: A Micro Project On
No ratings yet
Maharashtra State Board of Technical Education, Mumbai: A Micro Project On
16 pages
1 Notes: Vreps Bubble Rob Tutorial
No ratings yet
1 Notes: Vreps Bubble Rob Tutorial
13 pages
NCP-US-6.5 Final
No ratings yet
NCP-US-6.5 Final
23 pages
Design and Implementation of RFID Based Smart Shopping Booth
No ratings yet
Design and Implementation of RFID Based Smart Shopping Booth
5 pages
Jailbreak iOS
No ratings yet
Jailbreak iOS
15 pages
All Supported Modeles For Unlock by Code Nokia
No ratings yet
All Supported Modeles For Unlock by Code Nokia
1 page
Chapter 3 PPT
No ratings yet
Chapter 3 PPT
30 pages
The Scientific Method in NLP and GPT Model Development
No ratings yet
The Scientific Method in NLP and GPT Model Development
55 pages
Rahul - Data Life Cycle Management in Big Data Analytics
No ratings yet
Rahul - Data Life Cycle Management in Big Data Analytics
8 pages
Az 900 PDF
No ratings yet
Az 900 PDF
10 pages
Polygon Meshes
No ratings yet
Polygon Meshes
3 pages
Project Management Course
No ratings yet
Project Management Course
94 pages
Linux Install Guide for eBox-33xx/PDX-057T
No ratings yet
Linux Install Guide for eBox-33xx/PDX-057T
5 pages
Lecture 0 - CS50x PDF
No ratings yet
Lecture 0 - CS50x PDF
17 pages
EPCG CP User Manual v0.2
No ratings yet
EPCG CP User Manual v0.2
69 pages
MITRE ATT&CK Framework: Abdullah Khalid
No ratings yet
MITRE ATT&CK Framework: Abdullah Khalid
20 pages
9 Class Diagram
No ratings yet
9 Class Diagram
23 pages
SWR202
No ratings yet
SWR202
4 pages
Cli Acs
No ratings yet
Cli Acs
206 pages
5.black Box Testing and Levels of Testing
No ratings yet
5.black Box Testing and Levels of Testing
75 pages
Data-Driven Testing
No ratings yet
Data-Driven Testing
66 pages
Auto Ethernetbr
No ratings yet
Auto Ethernetbr
20 pages
JAWABAN Soal MTCNA
100% (1)
JAWABAN Soal MTCNA
5 pages
Excel Presentation in English With Less Slides
No ratings yet
Excel Presentation in English With Less Slides
10 pages