ETRI Journal - 2024 - Park - NEST C A Deep Learning Compiler Framework For Heterogeneous Computing Systems With Artificial
ETRI Journal - 2024 - Park - NEST C A Deep Learning Compiler Framework For Heterogeneous Computing Systems With Artificial
DOI: 10.4218/etrij.2024-0139
SPECIAL ISSUE
1
Artificial Intelligence Computing
Research Laboratory, Electronics and Abstract
Telecommunications Research Institute, Deep learning (DL) has significantly advanced artificial intelligence (AI); how-
Daejeon, Republic of Korea
ever, frameworks such as PyTorch, ONNX, and TensorFlow are optimized for
2
Samsung Electronics, Hwaseong,
Republic of Korea
general-purpose GPUs, leading to inefficiencies on specialized accelerators
such as neural processing units (NPUs) and processing-in-memory (PIM)
Correspondence devices. These accelerators are designed to optimize both throughput and
Jemin Lee and Yongin Kwon, Artificial
Intelligence Computing Research energy efficiency but they require more tailored optimizations. To address
Laboratory, Electronics and these limitations, we propose the NEST compiler (NEST-C), a novel DL frame-
Telecommunications Research Institute,
work that improves the deployment and performance of models across various
Daejeon, Republic of Korea.
Email: [email protected] and AI accelerators. NEST-C leverages profiling-based quantization, dynamic graph
[email protected] partitioning, and multi-level intermediate representation (IR) integration for
efficient execution on diverse hardware platforms. Our results show that
Funding information
This study is supported by a grant from NEST-C significantly enhances computational efficiency and adaptability
the Institute of Information & across various AI accelerators, achieving higher throughput, lower latency,
Communications Technology Planning &
Evaluation (IITP), funded by the Korean
improved resource utilization, and greater model portability. These benefits
government (MSIT) (No. RS- contribute to more efficient DL model deployment in modern AI applications.
2023-00277060, Development of
OpenEdge AI SoC hardware and software KEYWORDS
platform). AI accelerator, deep learning compiler, heterogeneous computing, model quantization,
multi-level IR
This is an Open Access article distributed under the term of Korea Open Government License (KOGL) Type 4: Source Indication + Commercial Use Prohibition +
Change Prohibition (http://www.kogl.or.kr/info/licenseTypeEn.do).
1225-6463/$ © 2024 ETRI
still rely on DL frameworks, creating a gap between these distribute the workload of DL models. This is achieved
DL frameworks and AI accelerators can hinder the devel- by developing algorithms that enable dynamic task
opment and deployment of neural networks in fields with partitioning in mixed hardware systems, which include
various constraints. multiple AI accelerators, thereby maximizing compu-
Traditional DL frameworks provide abstract func- tational efficiency.
tionalities but often lack the detailed optimizations • To enhance AI accelerator portability: By providing
required by various AI accelerators. To solve this prob- integration interfaces at each stage of the intermediate
lem, DL compilers have been developed [4]. These com- representation (IR) in the compilation process,
pilers take DL models developed within DL frameworks NEST-C makes it easier to connect various hardware
as inputs, optimize them, and translate them for specific with compilers. This approach ensures that the
hardware to produce efficient executable code. The opti- compiler can be utilized more conveniently, facilitating
mization process includes graph optimization, memory seamless operation across different AI accelerators
management, parallel processing, quantization, and exe- and enhancing the system’s overall flexibility
cution tuning. DL models are efficiently optimized to and efficiency.
run on a diverse range of hardware, from datacenter
servers to mobile devices and embedded IoT sensors.
Leading DL compilers such as tensor virtual machine 2 | BACKGROUND AND
(TVM) [5], Glow [6], and XLA [7] offer specialized fea- RELATED WORK
tures and techniques for optimization and deployment
across a wide range of applications. However, these 2.1 | Common structures of DL
existing solutions often fail to optimize the performance compilers
of newer AI accelerators such as neural processing units
(NPUs) [8–12] and processing-in-memory (PIM) [13, 14] The general design architecture of a DL compiler is
devices. divided into front-end and back-end architectures to
The rapid advancement of AI accelerators such as facilitate optimization and support for various hardware
NPUs and PIM devices presents significant challenges. platforms [4]. Additionally, there is an abstract represen-
Traditional compilers, primarily designed for CPUs and tation known as IR. IR is a crucial concept in compiler
GPUs, struggle to maximize the performance and lever- design that acts as a bridge between high-level program-
age the unique features of these new accelerators. This ming languages and machine code. This simplifies the
gap necessitates a compiler framework that can complex process of translating human-readable code into
efficiently optimize DL models for these diverse AI instructions that can be executed by hardware. LLVM
accelerators. IR [16] is among the most commonly used IRs for
To overcome these challenges, this paper proposes targeting CPUs and GPUs, owing to its versatility and
the NEST compiler (NEST-C) [15], an advanced DL com- extensive support within the LLVM compiler. However,
piler framework designed to simplify deployment and because it is a low-level representation, LLVM IR is not
enhance the efficient execution of DL models. As an inherently suited for optimization at the level of DL
open-source project, NEST-C generates optimized codes operators, which often requires more abstract and higher
for various AI accelerators, including NPUs and PIM level transformations to achieve efficient execution on
devices. Furthermore, NEST-C offers tuning features and various hardware platforms. Consequently, DL compilers
tools tailored to the characteristics of each AI accelerator. require higher level IRs that are better suited for expres-
The main contributions of NEST-C are as follows: sing and optimizing DL models. To accommodate the
complexity and diversity of DL models and hardware
• To enable DL models for heterogeneous AI accelera- platforms, DL compilers may employ multiple levels of
tors: NEST-C facilitates the use of DL models on a IR, each serving a different stage of the compilation pro-
variety of AI accelerators, such as NPUs and PIMs, cess. The front end handles DL models from frameworks
by supporting necessary adaptations for quantization such as TensorFlow and PyTorch, performing optimiza-
and optimization. This ensures that models can tions such as operation fusion, the elimination of redun-
efficiently operate across heterogeneous hardware dant operations, and memory access optimization to
platforms, broadening their applicability and improve efficiency. It then abstracts the model’s struc-
performance. ture and operations into standardized high-level IRs. In
• To optimize the DL model execution for multiple AI the front end, a high-level IR is used to optimize the
accelerators through graph partitioning: NEST-C uti- relationships between operators and tensors in a
lizes graph partitioning techniques to efficiently hardware-independent manner. For example, TVM’s
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARK ET AL. 853
relay IR [17] uses tensors and placeholders for data rep- computation from scheduling and is inspired by Halide.
resentation, thereby providing scalability for various TVM’s optimization process involves graph optimization,
operators. The back end uses low-level IRs, which con- operator-level optimization, and automatic tuning, facili-
tain more hardware specifications than the higher layers. tated by a machine learning-based system to identify the
Based on this information, it optimizes the code to optimal schedule among billions of possibilities. This
account for hardware-specific characteristics, calling approach enables efficient code generation for various
hardware-specific libraries where necessary to maximize targets including CPUs, GPUs, FPGAs, and ASICs.
execution efficiency. In addition, the back end translates TVM’s IR system, consisting of Relay and TIR, abstracts
the code into code that can run on actual hardware model operations and structures at different levels. This
(e.g., CPUs, GPUs, NPUs, and PIMs). In the back end, a enables precise optimizations and code generation tai-
low-level IR is utilized for hardware-dependent optimi- lored to the specific characteristics of the hardware.
zation and code generation for the target hardware. For Graph Lowering (Glow) , developed by Facebook, aims to
instance, TVM advances the approach based on Halide, optimize DNN models for efficient execution across dif-
and Glow optimizes tensor processing with command- ferent hardware platforms using a two-stage IR process.
based IR. The optimization involves high-level graph optimizations
followed by hardware-specific optimizations through
node lowering, focusing on minimizing memory con-
2.2 | Existing DL compilers sumption and maximizing execution speed. Glow’s IR
enables high-level graph optimizations and decomposes
Google’s XLA enhances TensorFlow by providing high-level operators into lower level linear algebra nodes,
hardware-agnostic and hardware-specific optimization enabling efficient execution on CPUs, GPUs, and ASICs.
through its compiler framework. It improves execution This process prioritizes model portability while optimiz-
speeds and memory usage in DL models using techniques ing performance and uses an “in-memory form” lower
such as operator fusion and buffer analysis. XLA offers level IR for hardware-dependent optimizations and mem-
both just-in-time (JIT) and ahead-of-time (AOT) compila- ory latency hiding.
tions, supporting diverse types of hardware such as CPUs All existing compilers were primarily designed for
and NVIDIA GPUs [18]. TVM, an open-source project, CPUs and GPUs, making it challenging to adapt them
employs a multi-layered optimization strategy that sepa- for the newly emerging NPUs and PIMs. The arrival of
rates computation from scheduling. It optimizes code complex and varied AI accelerators such as NPUs and
across CPUs, GPUs, FPGAs, and ASICs using its unique PIMs has significantly increased the difficulty of optimiz-
IR systems RelayIR and tensor IR. Meta’s Glow focuses ing edge devices that incorporate these accelerators.
on optimizing deep neural network models across various Moreover, they do not account for the complexities
platforms using a two-stage IR process [19], prioritizing involved in the simultaneous optimization and leveraging
efficient execution and model portability. It supports of the parallel capabilities of various AI accelerators. Dis-
CPUs and GPUs by optimizing memory usage and execu- tinct differences exist between the current compilers
tion speed. (TVM, Glow, and XLA) and NEST-C, as detailed in
Accelerated Linear Algebra (XLA) is developed by Google Table 1. NEST-C includes features in each optimization
for TensorFlow. It uses a compiler framework that per- category that are not supported by the existing compilers
forms hardware-agnostic high-level and hardware- and, notably, provides broader support a wider range
specific low-level optimizations. The high-level optimizer of NPUs.
(HLO) IR is used for graph-level optimizations, and
LLVM is used for code generation. XLA optimizes Ten-
sorFlow graphs using techniques such as operator fusion, 3 | NES T-C
common subexpression elimination, and buffer analysis,
resulting in improved execution speeds and memory 3.1 | NEST-C overview
usage for deep neural networks (DNNs). XLA optimizes
TensorFlow graphs using techniques such as operator This study devised NEST-C, a DL compiler designed to
fusion, common subexpression elimination, and buffer support various edge-specific AI accelerators, and
analysis, resulting in improved execution speeds and provides optimizations including graph partitioning,
memory usage for DNNs. quantization, and execution tuning. NEST-C supports
TVM is an open-source platform that introduces a multi- traditional processing units such as CPUs and GPUs
layered optimization approach. This approach separates while generating optimized code for AI accelerators such
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
854 PARK ET AL.
Optimization
feature TVM Glow XLA NEST-C
Quantization Calibration-based Int4, Profiling-based Int8, Profiling-based Int8, Profiling-based Int8, Int16,
Int8, and Fp16 Int16, and Fp16 Int16, and Fp16 and Fp16; layer-wise mixed
precision
Graph partitioning Memory-based static Static partitioning Static and dynamic Profiling-based dynamic
partitioning partitioning partitioning
NPU back-end VTA (open Habana (closed Google TPU EVTA, AimFuture NMP,
support architecture) architecture) OpenEdge Enlight, and SK
Hynix GDDR6-AiM
Auto-tune of tile Execution time Execution time profiling Auto-tuning and Hardware utilization,
size profiling, uniform tile hardware-specific profiling based, ununiform
size, and static tile optimization tile size, and dynamic tile
scheduling scheduling
Layer fusing and Static layer fusing and Static layer fusing and Operator fusion for Dynamic layer fusing and
execution layer-by-layer layer-by-layer execution optimized execution dynamic layer execution
execution
NN deployment on O
heterogeneous
accelerators on a
device
NN deployment on O O
multiple devices
accelerators. NEST-C employs a tensor IR to perform Detailed explanations of each configuration selected
hardware-dependent optimizations for each operator sep- by Algorithm 1 are provided below.
arately. This approach allows the fine-tuning of the per-
formance of DL models on AI accelerators by optimizing
memory usage, computational efficiency, and execution
flow based on the unique characteristics and capabilities
of each target device. The code generator creates execut-
able code by utilizing a variety of back-end code genera-
tors, including the “C code generator,” (CCodeGen)
“LLVM IR generator,” “relay IR generator,” and “NPU
code generator.” Each code generator produces code
optimized for specific processing units, which are eventu-
ally converted into executable files using compilers such
as GCC, LLVM, TVM relay, and those provided by each
back end.
value distributions and toggling between symmetric Z = operation to y0 ¼ maxð0, w ∗ x þ bÞ by applying nonlinear-
0) and asymmetric (Z = 128) quantization as follows: ity immediately after convolution, thereby reducing com-
putational steps and memory overhead. Layer fusion,
V f 32 MAXðABSðV f32 ÞÞ
Qi8 ¼ ROUND þZ , S¼ , implemented as an optimization pass at the compiler
S 2n 1
: ð3Þ level, streamlines neural network operations by reducing
Z ¼ f 128, if V min ≥ 0, otherwise,
computational steps and memory usage, enhancing effi-
V f 32 ¼ ðQi8 ZÞ S ciency and speed, especially in memory-restricted
The symmetric with a power of two simplifies the environments.
hardware design using bit-shift operations instead of Operator scheduling: From the perspective of IR schedul-
multiplications and is defined by ing in neural network computation, the scheduling of
MAXðABSðV f ÞÞ
operators is crucial for optimizing performance. Transfor-
dlog2 e
S¼2 2ðn1Þ 1 : ð4Þ mations, such as in-place buffer modifications for
element-wise arithmetic, are key examples of such opti-
mizations. In this context, instructions manipulate global
Clipping: To mitigate accuracy loss without retraining, variables or locally allocated buffers, with each operand
the Glow compiler employs clipping to minimize the annotated using the qualifiers in, out, or inout. These
Kullback–Leibler divergence between the floating-point qualifiers indicate whether a buffer is read from (in),
and quantized distributions, addressing the impact of written to (out), or both (inout), indicating to the opti-
outliers in weight and activation distributions. mizer when optimizations such as copy elimination or
Granularity: The choice between tensor-wise and buffer sharing, are possible.
channel-wise quantization granularity balances accuracy
and latency, with fine granularity increasing computa-
tional demand, particularly in convolutions with diverse 3.2.3 | Graph partitioner (PartitionTuner)
weight values.
Mixed precision: NEST-C supports mixed-precision quan- Typically, edge AI accelerators support a limited range of
tization by maintaining the first and last layers at their DL operations, necessitating the partitioned execution of
original precision (FP32) for execution. The first and last DL models. Frameworks such as XLA and TVM provide
layers are known to be the most sensitive to quantiza- code generation for accelerators but are limited to sup-
tion [20] and are prioritized for quick mixed-precision porting only a single accelerator. To address this issue,
decisions. For additional mixed-precision applications, NEST-C integrates PartitionTuner [21], which distributes
developers must determine the layers that should be the operations of a DL model across multiple accelerators
quantized. and synchronizes the processing of their operations to
enhance the overall performance. Figure 2 illustrates the
architecture of this approach.
3.2.2 | Hardware-independent optimizations Initially, the Branch Extractor analyzes the structure
of the input graph to identify branches that require
Hardware-agnostic optimization focuses on the structural sequential execution. Subsequently, the Graph Parti-
and algorithmic characteristics of DL models. This tioner segments the graph into groups of operations for
approach performs optimizations that can generally each branch. The Profiler then generates a machine
enhance performance across all types of hardware. In
addition to the PTQ techniques previously mentioned
(Section 3.2.1), the key hardware-agnostic optimization
techniques applied in NEST-C include layer fusion and
operator scheduling.
Layer fusion: Fusion, such as combining convolution with
batch normalization and ReLU activation, serves to
increase computational efficiency and minimize memory
usage. The convolution operation, denoted by
y ¼ w ∗ x þ b, when fused with batchpnormalization,
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi trans-
forms into y0 ¼ γ ½ðw ∗ x þ b μÞ= ð σ 2 þ ϵÞ þ β, stream-
lining the computation by merging the standardization
directly into the convolution process. Similarly, fusing
convolution with ReLU activation simplifies the FIGURE 2 Workflow of the PartitionTuner in NEST-C.
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARK ET AL. 857
code for each back end identified by the Back-End inputs from memory are partially stored, with outputs
Finder and profiles the execution times for each group accumulated as the computations proceed. This necessi-
of operations. Based on these profiling results, a user- tates diverse scheduling strategies such as input station-
modifiable Partition Policy file is automatically gener- ary, weight stationary, and output stationary [22].
ated. The Partition Scheduler creates partitions based “Tiling” involves storing and processing inputs and out-
on the Partition Policy using the back-end information puts in the buffer sequentially; an NPU requires all
assigned to the operations. It then develops a Schedul- inputs for a tile to be loaded and space for outputs to be
ing Plan that specifies the execution sequence among available before processing begins.
the partitions. Subsequently, based on the Scheduling The period an NPU waits to begin operation is called
Plan, the Partition Tuner generates Scheduling Code for the “memory latency,” and strategies such as adjusting
each back end. This approach accurately identifies the the size of tiles and employing double buffering are used
fastest optimal partition for each accelerator, ensuring to reduce this time, a practice known as “memory latency
the optimal utilization of various accelerators within hiding.” NEST-C provides three main optimization
heterogeneous computing systems. Additionally, the approaches to maximize memory latency hiding and
overall execution speed can be significantly improved determine the optimal tile size and scheduling.
by employing independently operable accelerators in Empirical discovery through device profiling: This involves
parallel. experimenting on actual devices to find optimal settings.
However, an exhaustive search of all the possible config-
urations can be time-consuming or inefficient.
3.3 | NEST-C back end Auto-tuning using machine learning: To mitigate the chal-
lenges of exhaustive search, NEST-C employs auto-tuning
3.3.1 | Tensor IR and hardware-dependent techniques that leverage machine learning to streamline
optimization the optimization process, which significantly reduces the
search space.
During graph partitioning, the hardware on which the Utilization of performance modeling or performance accu-
partitioned graph will execute is determined, and based rate simulators: This method involves using performance
on this specific hardware, the IR is lowered to the tensor- models or simulations to search all cases. If a model or
IR level. At this level, the memory size and location of simulator is implemented precisely, the optimal execu-
the inputs and outputs used by each operator are deter- tion method can be identified quickly.
mined. This phase aims to find the optimal execution
code that minimizes memory usage while improving exe-
cution performance. NEST-C can dynamically allocate 3.3.2 | Code generator
the main memory with the help of the operating system
depending on the target hardware. However, this method The code generator creates code tailored to the target
introduces overhead for allocation and deallocation. Fur- hardware based on the optimized tensor IR. Depending
thermore, if the DL tensor data are distributed across var- on the target hardware, it may need to generate binary
ious memory areas, it could degrade the performance of machine code or code that conforms to the device driver
the NPU’s direct memory access. or inference library interface of the target hardware.
Therefore, NEST-C calculates the total memory size Alternatively, the tensor IR could be lowered to an LLVM
required to execute the operations of a specific partition IR via an LLVM code generator.
from beginning to end. Subsequently, it allocates that It is necessary to implement a unique code generator
amount of memory—preferably a contiguous block— for each target hardware, and NEST-C provides refer-
just once. It then devises a strategy to statically reuse ences that can be consulted when developing code gener-
memory within that allocated space without further ators for new hardware. Additionally, NEST-C offers
assistance from the operating system. This approach functionality that not only generates code for several pop-
reduces memory fragmentation and allocation overhead, ular models (e.g., ImageNet classification) but also auto-
ensuring efficient memory management and optimized matically creates code for input preprocessing and output
performance for the DL computations on the targeted presentation. The code generator further includes various
hardware. back-end generators, such as C, NPU, and EVTA code
NPUs have small internal buffers so that they can generators. The following section details the integration
deliver data faster than DRAM; however, they cannot of the code generator with diverse AI accelerators across
store all inputs and outputs simultaneously. Instead, each multi-level IR.
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
858 PARK ET AL.
represent either an entire model or a specific group of numbers within the fixed-point integer constraints of the
consecutive operators. The generated YAML file includes NMP, thereby optimizing both the accuracy and effi-
information regarding each partition and the names of ciency of DL computations.
the operators used in the IR graph. This information is Back-end optimization for NMP Because each processing
utilized for code generation tailored to the partitioning or unit in NMP possesses its own buffer and accelerator,
integration with specific back-end compilers. operations must be well tiled and distributed, similar to
ONNX IR Legalizer: This module performs normalization EVTA. Whereas EVTA determines the optimal tile size
to resolve representation differences between the graph and number of buffers by profiling the execution perfor-
IR and the ONNX IR. This step is necessary for convert- mance on real devices considering only output stationary
ing the graph IR into ONNX IR while ensuring consis- scheduling, NMP considers input stationary, weight sta-
tency in the representation. tionary, and output stationary scheduling. It uses a per-
Sanity Checker with ONNXRuntime: To verify that the formance model to exhaustively search for and identify
model converted to ONNX functions correctly, this mod- the best performance across all scenarios. This compre-
ule uses ONNXRuntime, a compiler for general-purpose hensive approach enables NMP to optimize the resource
CPUs, to validate the model’s results. allocation and processing efficiency for diverse computa-
Partial ONNX Exporter: This module converts and stores tional patterns in DL operations.
the YAML meta file in the ONNX format. This step con-
verts the metafile, after hardware-independent optimiza-
tion, to the final ONNX format. 4.4 | EVTA
Using this structure, NEST-C partitions and normal-
izes the input model before generating the final ONNX EVTA is a custom NPU based on the VTA [23] from the
format model. This model undergoes hardware- Apache TVM project. It modifies VTA’s compute module
independent optimization and is integrated with the to include an MOV instruction, reducing power usage
Enlight compiler, enabling hardware-accelerated infer- and improving performance by eliminating DRAM data
ence. This approach also allows for the integration of a transfers between the output and input buffers. Moreover,
third-party private compiler. EVTA supports diverse operations (INT8, FP16, FP32,
and binary) and allows multiple NPUs to work together
and share DRAM resources, as illustrated in Figure 4.
4.3 | Tensor-IR implementation EVTA’s IR is extended within the structure of NEST-
C’s graph IR and tensor IR to meet the hardware charac-
The neuromorphic processor (NMP) is an embedded teristics and optimization requirements.
evaluation board developed by LG Electronics centered Graph-IR and middle-end optimization At the middle end
on a novel architecture designed to facilitate efficient DL of NEST-C, efforts are made to minimize DRAM access
operations. The foundational principle of the NMP’s and maximize hardware utilization by changing the data
design involves leveraging RISC-V instruction set archi- layout according to the DL operators and performing
tecture extensions to create specialized instructions for layer fusion optimizations. The compute module of
various CNN components, including convolutional EVTA can process ReLU operations in conjunction with
layers, fully connected layers, pooling layers, and matrix multiplication; hence, ReLU is fused with convo-
element-wise operations. The architecture of the NMP lution and fully connected operators. Because EVTA can
features a multicore NPU comprising several processors.
Each processor houses multiple processing units, each
equipped with a RISC-V core, a multiply-accumulate
unit, and memory buffers to support the computational
demands of neural network processing.
Quantization for NMP NMP is a hardware accelerator
that supports only integers in Q-format. Therefore,
NEST-C middle end for NMP must ensure that all DL
models are quantized into Q-format. For Q-format quan-
tization, the minimum and maximum values of all ten-
sors holding floating-point values are calculated, and
NEST-C determines the most suitable format, either
INT8 or INT16, as supported by NMP. This process
allows the precise representation of floating-point FIGURE 4 EVTA architecture and configuration.
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
860 PARK ET AL.
only perform matrix multiplication during the middle- experiment to demonstrate the reduction in latency and
end stage, it transforms a 4-dimensional input into a improved computational efficiency when using both CPU
6-dimensional one, converting the last two dimensions and NPU resources with NEST-C. Next, to verify the
into a matrix size that EVTA can process in one cycle. portability of NEST-C, we tested it with two types of com-
Additionally, EVTA’s graph IR was implemented to rep- mercial AI accelerators: AimFuture NMP for the tensor-
resent such a six-dimensional data layout. Following this, IR interface and OpenEdge Enlight for the graph-IR
graph partitioning is performed according to the configu- interface. These experiments encompassed a variety of
ration of multi-EVTA, and it is then lowered to tensor IR. DL models trained on ImageNet [24], including GoogLe-
Tensor IR and back-end optimization In NEST-C back Net [25], ResNeXt50 [26], ResNet50 [27], ResNet18 [27],
end, as the graph IR is lowered to tensor IR, the execu- MNIST [28], LeNet [29], MobileNetV2 [24], and Squeeze-
tion order of each operator and the DRAM storage loca- Net [30]. Considering the diversity in the types of opera-
tions for the operator’s input and output data are tions supported by each AI accelerator, distinct DL
determined. For EVTA, the inputs must be tiled because models were chosen for each experiment.
the buffer size is limited. Double buffering is utilized to
maximize memory latency hiding. The extent of latency
hiding varies with the tile size and number of buffers, 5.1 | EVTA with PartitionTuner
and NEST-C employs profiling and auto-tuning to deter-
mine the optimal tile size and number of buffers. The Xilinx ZCU102 platform was used to evaluate the
Codegen NEST-C provides the EVTA execution library performance of employing EVTA in NEST-C.
along with the device driver, featuring an interface in the The ZCU102 provides a quad-core ARM Cortex-A53 CPU
format of Figure 5. NEST-C generates the code according and FPGA. Four EVTAs were ported to the FPGA on the
to the execution library interface. For convolution opera- Xilinx ZCU102 platform to run at 333 MHz. CCodeGen
tions, the arguments also include optimized tile sizes and was used to generate the machine code for the CPU. The
the number of buffers. The generated code is designed to model was partitioned by CPU and NPU using Partition-
allow the device driver to parse the arguments at runtime Tuner, which also performed the quantization for the
and immediately execute the code. EVTAs. Table 2 shows a significant reduction in latency
NEST-C is tasked with handling a large number of when NPUs are used in conjunction with the CPU
tensor data and deep layers. The code generated by instead of using the CPU alone. However, increasing the
NEST-C computes each layer, and the results are tem- number of NPUs does not significantly reduce latency.
porarily stored in DRAM. To ensure that the EVTA This is because quantization and data dimension trans-
developed using NEST-C is error free, it is necessary to formation operations required to use NPUs are executed
validate the computational results with respect to the on the CPU, which is much slower than the NPUs. The
expected values. For this purpose, NEST-C supports a parallel execution of these operations by two NPUs and
debugging mode that allows the temporary results the CPU limits fully parallel processing. Therefore, to
stored in DRAM for each layer to be outputted to a file. improve the performance of multiple NPUs, it is impor-
The results of the operations may differ slightly, tant to partition and schedule the model.
depending on the hardware and data type. NEST-C
accommodates these differences by allowing the toler-
ance level to be set. TABLE 2 Latency (ms) of the DL models on multi-EVTA.
5.2 | Tensor-IR interface evaluation Mish. Additionally, pooling operations are supported
with configurations of 4:1 (2 2) and Stride 2, which are
To assess the performance of AI accelerators at the available in max/average and global average pooling
tensor-IR level in NEST-C, the NMP [22] board provided modes.
by AimFuture was utilized. Various DL models were The experimental results, as shown in Figure 6A,
evaluated, including MNIST, LeNet, ResNet18, ResNet50, indicate a decrease in accuracy when models are exe-
SqueezeNet, and Inception. Performance evaluations for cuted on OpenEdge Enlight. This decrease was due to the
each DL model within NEST-C were conducted and com- quantization process, which reduced the model precision
pared with the performance data from XLA provided by from 32 to 8 bits. Despite this reduction, the models
NMP. The model accuracy provided by Google Tensor- maintained acceptable performance.
Flow was used as baseline. Table 3 shows that most of In terms of the inference latency, as depicted in
the reference models exhibited results similar to the base- Figure 6B, the integration of NEST-C and OpenEdge
line accuracy, suggesting that NEST-C was effectively Enlight achieved impressive inference times of 9 ms for
implemented at the tensor-IR level within NMP. Unlike MobileNetV2 and 107 ms for Resnet50. This performance
NMP’s XLA, which is optimized directly by the hardware is significantly faster than the inference times on an Intel
manufacturer, NEST-C automatically performs optimiza- i7 CPU, which are 59.9 ms for MobileNetV2 and
tion at the compiler level through hardware profiling. 136.9 ms for Resnet50. It is also comparable to the results
Despite this, the latency performance of NEST-C was on an NVIDIA 2080ti GPU, which are 15.35 ms for Mobi-
very close to that of NMP’s XLA, indicating its ability to leNetV2 and 14.8 ms for Resnet50.
maintain optimal performance even when new AI accel-
erators were integrated.
5.4 | Discussion and limitations
5.3 | Graph-IR interface evaluation The experimental results demonstrated the effectiveness
and versatility of the NEST-C framework across various
The proposed common ONNX-based interface was used
to experimentally validate the successful linkage between
the general AI compiler and the private NPU compiler.
For a comparative evaluation, the results generated by
integrating Enlight and NEST-C were compared with the
performance results of Resnet50 and MobileNetV2 DL
models on general hardware CPUs and GPUs supported
by the existing NEST-C.
OpenEdge Enlight supports various layer types,
including convolution layers with kernel sizes ranging
from 1 1 to 7 7 and strides from 1 to 4. The depth-wise (A) (B)
convolution layer supports a 3 3 kernel with strides
ranging from 1 to 2. The supported activation functions F I G U R E 6 Accuracy and latency of the models on the targets
include Bypass, ReLU, Leaky ReLU, Sigmoid, Tanh, and (Intel i7-8700 CPU, 2080ti GPU, and U280 FPGA-based NPU).
software stack for GDDR6-AiM, (IEEE Hot Chips 34 Symp., model size, arXiv preprint, 2016. https://doi.org/10.48550/
Cupertino, CA, USA), 2022, pp. 1–25. arXiv.1602.07360
14. Samsung, HBM-PIM: cutting-edge memory technology to accel-
erate next-generation AI, 2023. https://semiconductor.samsung.
com/. Accessed: 2024-03-18. AUTHOR BIOGRAPHIES
15. ETRI, NEST-C. https://gitlab.com/ones-ai/nest-compiler.
Accessed: 2024-03-22. Jeman Park received his BS, MS,
16. C. Lattner and V. Adve, LLVM: a compilation framework for
and PhD degrees in Electronics and
lifelong program analysis & transformation, (Int. Symp. Code
Gener. Optim., San Jose, CA, USA), 2004, pp. 75–86.
Computer Engineering from Hanyang
17. J. Roesch, S. Lyubomirsky, L. Weber, J. Pollock, M. Kirisame, University, Republic of Korea, in
T. Chen, and Z. Tatlock, Relay: a new IR for machine learning 2004, 2006, and 2014, respectively. He
frameworks, (Proc. 2nd ACM SIGPLAN Int. Workshop Mach. is a senior researcher at the Electron-
Learn. Program. Lang., Association for Computing Machinery, ics and Communications Research
Philadelphia, PA, USA), 2018, pp. 58–68. Institute, Daejeon, Republic of Korea. His research
18. J. Dean, Machine learning for systems and systems for machine interests include computer networks, edge computing,
learning, Presentation at 2017 Conf. Neural Inf. Process. Syst.,
and deep learning compilers.
Curran Associates, Long Beach, CA, USA, 2017.
19. Meta, Glow’s Graph IR optimization. https://github.com/ Misun Yu received the MS degree
pytorch/glow/blob/master/docs/Optimizations.md. Accessed: from the Department of Computer
2024-02-22. Science and Engineering at Pohang
20. J. Lee, M. Yu, Y. Kwon, and T. Kim, Quantune: Post-training
University of Science and Technol-
quantization of convolutional neural networks using extreme
gradient boosting for fast deployment, Future Gener. Comput.
ogy, Republic of Korea. She is a prin-
Syst. 132 (2022), 124–135. cipal researcher at the Electronics
21. Y. Misun, K. Yongin, L. Jemin, P. Jeman, P. Junmo, and K. and Communications Research Insti-
Taeho, PartitionTuner: an operator scheduler for deep-learning tute Daejeon, Republic of Korea. Her main research
compilers supporting multiple heterogeneous processing units, interests include concurrent program analysis, soft-
ETRI J. 45 (2023), no. 2, 318–328. ware testing, deep learning, and embedded systems.
22. R. Sousa, M. Pereira, Y. Kwon, T. Kim, N. Jung, C. S. Kim,
M. Frank, and G. Araujo, Tensor slicing and optimization for Jinse Kwon received MS and PhD
multicore NPUs, J. Parallel Distrib. Comput. 175 (2023), degrees in Computer Science and
66–79. Engineering from Chungnam
23. T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, J. National University, Daejeon, Repub-
Fromm, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy, lic of Korea in 2017 and 2024, respec-
A hardware-software blueprint for flexible deep learning special-
tively. He is currently a researcher at
ization, IEEE Micro 39 (2019), no. 5, 8–16.
24. J. Deng, W. Dong, and R. Socher, ImageNet: a large-scale hier-
the Electronics and Telecommunica-
archical image database, (IEEE Conf. Comput. Vision Pattern tions Research Institute, Daejeon, Republic of Korea.
Recognit., Miami, FL, USA), 2009, pp. 248–255. His research interests include deep learning compilers
25. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and on-device computing.
D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper
with convolutions, (Proc. IEEE Conf. Comput. Vision Pattern
Junmo Park received the BS degree
Recognit. (CVPR), Boston, MA, USA), 2015, pp. 1–9. in Computer Science from Kwang-
26. S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated woon University, Seoul, Republic of
residual transformations for deep neural networks, (IEEE Conf. Korea in 2012 and the MS degree
Comput. Vision Pattern Recognit. (CVPR), Honolulu, HI, from the Graduate School of Conver-
USA), 2017, pp. 1492–1500. gence Science and Technology at
27. K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for
Seoul National University, Republic
image recognition, (IEEE Conf. Comput. Vision Pattern Recog-
of Korea, in 2020. He joined Samsung Electronics in
nit. (CVPR), Las Vegas, NV, USA), 2016, pp. 770–778.
28. L. Deng, The MNIST database of handwritten digit images for Hwaseong, Republic of Korea, in 2012, where he has
machine learning research, IEEE Signal Process. Mag. 29 been involved in compiler optimization and develop-
(2012), no. 6, 141–142. ment. Since 2020, he has been working as a Principal
29. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based Software Engineer on mobile GPU compilers. His
learning applied to document recognition, Proc. IEEE 86 (1998), research interests include deep learning, compilers,
no. 11, 2278–2324. embedded systems, HW/SW co-design, and
30. F. N. Iandola, S. Han, and M. W. Moskewicz, SqueezeNet: Alex-
optimization.
Net-level accuracy with 50x fewer parameters and <0.5MB
22337326, 2024, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0139, Wiley Online Library on [27/05/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
864 PARK ET AL.
Jemin Lee received his BS and PhD Engineering from Seoul National University, Republic
degrees in Computer Science and of Korea, in 2010 and 2015, respectively. From 2015 to
Engineering from Chungnam 2019, he worked for Samsung Electronics as a Staff
National University, Daejeon, Software Engineer. Since 2019, he has been with the
Republic of Korea, in 2011 and 2017, Electronics and Telecommunications Research Insti-
respectively. He is currently a senior tute, Daejeon, Republic of Korea, where he is cur-
researcher at the Electronics and rently a senior researcher. His research interests
Telecommunications Research Institute, Daejeon, include neural processing units, compilers, deep
Republic of Korea. Since 2023, he has also served as learning, and embedded systems.
an assistant professor in the AI Department at the
University of Science and Technology, Daejeon,
Republic of Korea. He was a postdoctoral researcher
at the Korea Advanced Institute of Science and Tech- How to cite this article: J. Park, M. Yu, J. Kwon,
nology, Daejeon, Republic of Korea from 2017 to J. Park, J. Lee, and Y. Kwon, NEST-C: A deep
2018. His research interests include energy-aware learning compiler framework for heterogeneous
mobile computing and deep learning compilers. computing systems with artificial
intelligence accelerators, ETRI Journal 46 (2024),
Yongin Kwon received the BSc
851–864, DOI 10.4218/etrij.2024-0139
degree in Electrical and Electronic
Engineering from the Korea
Advanced Institute of Science and
Technology, Daejeon, Republic
of Korea, in 2008, and MS and PhD
degrees in Electrical and Computer