0% found this document useful (0 votes)
3 views9 pages

Computing-In-Memory Aware Model Adaption For Edge Devices

This paper presents a two-stage model adaptation process for Computing-in-Memory (CIM) architectures aimed at improving performance on edge devices. The first stage focuses on CIM-aware morphing to optimize model size and resource utilization, while the second stage implements quantization-aware training to mitigate errors from analog-to-digital conversion. The proposed method enhances CIM array utilization, achieves significant model compression, and maintains accuracy comparable to existing techniques.

Uploaded by

polagame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views9 pages

Computing-In-Memory Aware Model Adaption For Edge Devices

This paper presents a two-stage model adaptation process for Computing-in-Memory (CIM) architectures aimed at improving performance on edge devices. The first stage focuses on CIM-aware morphing to optimize model size and resource utilization, while the second stage implements quantization-aware training to mitigate errors from analog-to-digital conversion. The proposed method enhances CIM array utilization, achieves significant model compression, and maintains accuracy comparable to existing techniques.

Uploaded by

polagame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

Computing-In-Memory Aware Model Adaption For


Edge Devices
Ming-Han Lin, and Tian-Sheuan Chang, Senior Member, IEEE

Abstract—Computing-in-Memory (CIM) macros have gained must be quantized by the ADC, causing quantization errors to
popularity for deep learning acceleration due to their highly accumulate and severely degrade model accuracy. A common
parallel computation and low power consumption. However, workaround is to severely restrict the number of concurrently
limited macro size and ADC precision introduce throughput
and accuracy bottlenecks. This paper proposes a two- activated wordlines to match ADC precision (e.g., activating
arXiv:2510.14379v1 [cs.AR] 16 Oct 2025

stage CIM-aware model adaptation process. The first stage only 16 wordlines for a 4-bit ADC). However, this drastically
compresses the model and reallocates resources based on underutilizes the available parallelism of the CIM array and
layer importance and macro size constraints, reducing model throttles performance.
weight loading latency while improving resource utilization and
maintaining accuracy. The second stage performs quantization- To overcome these obstacles, researchers have proposed
aware training, incorporating partial sum quantization and
ADC precision to mitigate quantization errors in inference. The
various model adaptation strategies. One line of work focuses
proposed approach enhances CIM array utilization to 90%, on CIM-aware model compression and architecture search.
enables concurrent activation of up to 256 word lines, and For instance, E-UPQ [1] enhances model sparsity through
achieves up to 93% compression, all while preserving accuracy pruning and mixed-precision quantization but suffers from
comparable to previous methods. low macro utilization. XPert [2] co-searches for the neural
Keywords : Computing-in-memory, AI accelerator, Pruning
architecture and peripheral circuits, but its rigid optimization
framework, Network architecture search, Quantize aware constraints can limit flexibility. Similarly, CIMNet [3] uses a
training device-aware accuracy predictor for neural architecture search
but overlooks the significant performance penalty caused by
weight reloading.
I. I NTRODUCTION
The proliferation of complex deep learning models has Another line of work targets mitigating ADC quantization
spurred the development of specialized hardware accelerators effects. These methods aim to increase the effective number
for edge devices, where power and latency are critical of bits (ENOB) by mapping the multiply-accumulate (MAC)
constraints. Computing-in-Memory (CIM) has emerged as a distribution to the ADC’s input range. Approaches include
highly promising architecture, offering massive parallelism optimizing quantization ranges based on MAC statistics [4],
and reduced data movement by performing computations using input-conditioned subrange reduction techniques [5], or
directly within the memory array. However, the practical learning analog scaling factors [6], [7]. While effective, these
deployment of CIM is hindered by two fundamental and methods often do not account for the large number of partial
interconnected challenges rooted in its physical limitations. sums generated when many wordlines are activated in parallel,
First, Hardware Mapping and Throughput Bottlenecks arise or are designed for smaller CIM macros [6].
from the constrained physical size of CIM macros. Modern
deep neural networks are often too large to be stored entirely The existing literature reveals a critical gap: a holistic
on-chip, necessitating that model weights be repeatedly approach that simultaneously optimizes the model architecture
loaded from off-chip memory. This frequent reloading incurs for dense mapping onto the CIM array while also making
significant latency and energy overhead, negating many of the model inherently robust to the partial sum quantization
CIM’s intrinsic benefits. errors that arise from maximizing parallelism. To bridge this
Second, Computational Fidelity and Accuracy Degradation gap, this paper proposes a tailored model adaptation method
are direct consequences of the precision-limited analog-to- that adjusts the model architecture and recalibrates weights
digital converters (ADCs) inherent to CIM design. When to mitigate quantization errors. Our approach reallocates
convolutions are segmented due to hardware size limits, limited resources, such as bitlines per convolutional layer, to
multiple analog partial sums are generated. Each of these sums enhance efficiency while maintaining or improving accuracy.
We implement a two-stage quantization-aware training process
This work was supported by the National Science and Technology Council, that quantizes both weights and partial sums, simulating CIM
Taiwan, under Grant 111-2622-8-A49-018-SB, 110-2221-E-A49-148-MY3,
113-2221-E-A49-078-MY3, and 113-2640-E-A49-005.. The authors are behavior and reducing the impact of quantization on model
affiliated with the Institute of Electronics, National Yang Ming Chiao Tung accuracy.
University, Taiwan. (e-mail: [email protected], [email protected]).
cited as: M.-H. Lin and T. S. Chang, ”Computing-in-memory aware model The rest of the paper is organized as follows: Section
adaption for edge devices”, to be published in IEEE Transactions on Circuits
and Systems for Artificial Intelligence, 2026. II details the proposed methods, Section III presents the
Manuscript received XXXX XX, 2025; revised XXXX XX, XXXX. experimental results, and Section IV concludes the paper.
2

II. P ROPOSED CIM-AWARE M ODEL A DAPTION In Fig. 2, 64 5-bit partial sums are accumulated using an
adder tree and then multiplied by a scaling factor. Since the 64
A. The Target Multibit CIM Architecture ADCs are not used simultaneously, a multiplexer at each ADC
output selects the appropriate ADC for accumulation. The final
scaling factor combines both the weight scaling factor and
the ADC step size, addressing the need to reverse the effects
of scaling. This is necessary because the weights, initially in
decimal form, are quantized into 4-bit integers, and the partial
sums from the ADC also undergo scaling during conversion.

Fig. 1. 4-bit CIM macro architecture

Fig. 1 illustrates the configuration of the CIM macro used Fig. 3. Mapping convolution weights into a CIM macro
in this paper. The workflow involves the following steps: a
line buffer transfers 4-bit input data to a Digital-to-Analog Fig. 3 illustrates the weight mapping for convolution. Due
Converter (DAC), converting it into an analog signal that to the limited number of wordlines in the memory array, the
enters the CIM weight array’s wordlines. Each weight cell multiply-accumulate operation cannot be completed in a single
multiplies the input data, and the products are accumulated pass. Instead, the convolution kernel is divided into multiple
in each bitline. A multiplexer selects the processed signals, parts based on the number of wordlines, processed in batches,
which are then converted into 5-bit digital partial sums by an and accumulated for the final result. For instance, with 256
ADC. wordlines and a 3x3 filter size, one bitline can handle up to
In terms of precision, each weight cell uses 4 bits, 28 input channels, necessitating that any excess data be placed
with parallel inputs converted to voltage by the DAC. The in the next bitline.
ADC then transforms the analog signal into a 5-bit digital In the example, three filters are split into two parts, indicated
format. This system requires only one ADC conversion for by different colors, and stored in separate bitlines. The DAC
the multiply-accumulate operation, reducing the number of inputs to the CIM macro include the orange section of the
conversions by a factor of 16 compared to a bit-by-bit method, feature map, representing the first half of the input channels,
which helps minimize quantization errors, especially in the which perform dot products with the corresponding darker
most significant bits (MSB). sections of the filters. Consequently, only outputs from three
bitlines are valid at this stage, while the remaining data will
The CIM array consists of 256 wordlines and 256 bitlines,
be processed subsequently.
along with 64 ADCs. Each weight cell stores 4 bits of
data. The bitlines include positive (PBL) and negative (NBL)
lines. The multiplexer selects different bitlines, and the ADCs B. Overall Two-Stage Model Adaption Flow for CIM
operate in rotation to convert the analog signals into digital
sums.

Fig. 4. Model adaption flow for CIM

Fig. 4 outlines the overall model adaptation flow, consisting


Fig. 2. The digital circuits that assist our CIM macro of two stages: CIM Aware Morphing to align models with
3

macro size, and ADC Aware Learned Scaling to scale weights Eq. 1, where λ is a hyper-parameter that controls the weight of
based on quantization precision of both weights and ADC. the regularization term, and θ represents the model parameters.
CIM Aware Morphing adapts MorphNet [8] for CIM by
adjusting channel numbers to fit macro size constraints like Loss(θ) = LCE (θ) + λF (θ) (1)
numbers of bitlines and wordlines instead of the model size
or FLOPs in the original MorphNet. This iterative adjustment, To minimize redundancy, a regularization term related to
typically converging in about three iterations, ensures that the parameter count is designed as in MorphNet [8] to identify
model meets accuracy and resource requirements. redundant parameters (see Eq. 2). The convolution filter
After roughly determining the model’s shape and size, the dimensions are denoted as x and y. Filter importance is
next step involves quantizing the weights and partial sums determined by the γ of the BN layer, with small γ values
according to the CIM weight cell’s bit width, ADC precision, being zeroed out to prune unimportant filters. After pruning,
and ADC step size. ADC Aware Learned Scaling focuses on the remaining input and output channels, denoted as AL and
quantization-aware training in two steps: BL , correspond to the number of non-zero weights in the
preceding and subsequent BN layers. The pruned parameter
• Quantization-aware training for the weights, including
count is then calculated by multiplying AL and BL with x
training the quantization step size to minimize quanti-
and y. Here, IL and OL represent the number of input and
zation errors of weights, and
output channels for convolution layer L, while γL−1 and γL
• Quantization-aware training for partial sums.
denote the BN weights before and after the convolutional layer
With these processing, the final model not only benefits L, respectively.
from reduced redundancy through model morphing, which
eliminates unnecessary filters and computations, but also OL
X IL
X
addresses quantization errors through quantization-aware F (layerL) = x × y × (AL |γL , i| + BL |γL−1 , j|) (2)
training, mitigating any significant accuracy drops caused by i=1 j=1

weight and partial sum quantization.


To address CIM macro size constraints and identify
redundancy, we use parameter count as a regularization term
C. Stage 1: CIM Aware Morphing when adjusting channels. This approach targets deeper layers,
which typically contain more redundant parameters, helping
CIM Aware Morphing, based on MorphNet [8], adapts to maintain model accuracy during compression.
the number of channels in convolutional layers to account
For the ”Expanding Phase”, it is not possible to derive the
for the constraints of wordline and bitline quantities in CIM
expansion ratio for CIM macros directly using an equation
macros by iteratively shrinking and expanding layers within a
as with parameter expansion ratios due to the array-based
predefined architecture. In the shrinking phase, it prunes each
structure of CIM macros. Therefore, we first list the constraint
layer based on sparsity, varying the pruning ratio across layers.
equations for the model’s expansion ratio in the CIM macro
During the expansion phase, layers are proportionally scaled
as follows:
up according to predefined constraints, focusing on reducing
computational complexity or parameter count. This targeted 3 × kernel size2
⌈ ⌉ × round(C1 × R) (3)
approach can efficiently optimize network structures without wordlines
extensive architectural redesign or architectural search. n−1
X round(Ci × R)
+ [⌈ ⌉ × round(Ci+1 × R)] ≤ targetbl
i=1
channelsper bl
(4)

wordlines
channelsper bl =⌊ ⌋ (5)
kernel size2
Where R is the desired expansion ratio, n is the total number
of convolutional layers, Ci is the number of output channels
in the i-th convolutional layer, and channelsper bl represents
the maximum number of input channels that a single bitline
can accommodate.
Since solving the above inequality is very complex, we
use exhaustive search here. By incrementing the ratio from
1 by 0.001 until the condition is no longer satisfied, we
Fig. 5. Model morphing flow can find the desired expansion ratio. Additionally, only one
exhaustive search is needed per morphing process, making the
The details of the method are described below. In the search very efficient. Note that the expansion ratio is applied
”Shrinking Stage” of a deep learning network, the loss function proportionally across all layers, not a separate ratio for each
for channel pruning consists of two parts: the cross-entropy layer. This makes the optimization a simple one-dimensional
loss LCE (θ) and the regularization term λF (θ), as shown in search for a single scalar value.
4

D. Stage 2: ADC Aware Learned Scaling being quantized. For instance, if quantizing to n bits, then
Based on the above model adjustment, the next steps involve QN = QP = 2n−1 − 1. This process allows the quantization
two rounds of quantization-aware training as shown in Fig. 6. error to be reflected in floating-point representation.
First, we combine convolutional and BN weights and quantize W
them to 4 bits to fit within a 4-bit weight macro. Second, output = [round(clip( , −QN , QP ))] ∗ Input × SW (6)
SW
partial sum quantization is applied to obtain the final quantized
For our target macro, to produce 4-bit weights, the weights
model.
are first divided by SW for scaling (where SW is the weight
quantization step size, typically less than 1). Then, based on
the maximum and minimum values of the stored weight, the
weights are clipped and rounded to obtain 4-bit weights that
can be stored in the CIM macro. After performing convolution
in the CIM macro with the 4-bit quantized weights, the output
is multiplied by SW to scale it back down.
During the Phase-1 training, we optimize the BN and
Fig. 6. Quantization type for models mapped to the CIM macro convolution weights, along with the quantization step size SW .
The goal is to complete BN weight folding and quantize the
A convolution layer undergoes three types of quantization: weights, as detailed in Fig. 8.
1) Weight Quantization: Here, BN weights and convolu- In the backward pass, gradient computation bypasses scaling
tional weights from the morphed model are combined and non-differentiable rounding to maintain stability. The
and quantized to 4 bits according to the precision of the straight-through estimator (STE) is applied during rounding
weight cells in the CIM macro. skips: gradients exceeding the clipping range are set to
2) Partial Sum Quantization: The partial sums are zero, while those within the range pass through unchanged.
quantized to 5 bits based on the precision of the given Additionally, since amplified weights and output gradients are
ADC. used to compute input gradients, these input gradients are
3) Activation Quantization: This is included in the inversely scaled down according to the weight amplification.
original seed model and will be quantized to 4 bits based
on the DAC precision.

Fig. 8. Forward and backward data flow weight quantization

Fig. 7. Forwarding flow Phase1 training

1) Phase-1: Weight Quantization Training: Fig. 7 illustrates


the Phase-1 weight quantization process for the model. During
the forward computation of the model training, we reduce the
number of parameters by combining the BN parameters with
the convolutional kernel weights. These combined weights
are then scaled by dividing them by the corresponding
weight quantization step size, followed by clipping and
rounding based on the weight bit-width. After performing the
convolution with quantized activations, the results are scaled Fig. 9. Partial Sum Formation
back by multiplying with the scaling factors.
In the above process, the step size of weight quantizationis 2) Phase-2: Partial Sum Quantization Training: Due to
learned by the LSQ method [9]. The weight quantization limited wordlines, larger convolutions must be processed in
equation is presented in Eq. 6. Here, W represents the weight, segments, leading to accumulated ADC quantization errors
SW is the weight quantization step size, and −QN and with each partial sum. To mitigate this, we incorporate partial
QP represent the minimum and maximum clipping values, sum quantization during Phase-2 training to simulate ADC
respectively. These values are related to the number of bits behavior, which helps the model adapt to the quantization
5

process. For example, as shown in Fig. 9, with 256 wordlines, Compared to the Phase 1 training, the Phase 2 includes
a 3x3 kernel can accommodate up to 28 input channels per scaling the partial sums according to the ADC step size,
bitline, requiring additional channels to be assigned to another followed by rounding and summing. Finally, the scaling effect
bitline. Therefore, for a feature map and filter with 56 input of the ADC step size is inversely scaled back at the output.
channels, we divide them into two groups—denoted with blue In the backward pass, the gradient computation similarly
and purple in the figure. The blue feature map convolves with skips all scaling and non-differentiable rounding operations to
the blue filters, while the purple feature map convolves with ensure that the gradients do not experience sudden scaling up
the purple filters, resulting in two partial sums that can be or down, thus maintaining stability.
added point by point to obtain the final result.

Fig. 11. Forward and backward data flow partial sum quantization

Finally, the trained 4-bit weights can be directly used in the


CIM macro for convolution operations with 4-bit inputs. After
each convolution, the output only needs to be scaled by the
product of the weight step size SW and ADC step size SADC .
For further simplification, this product can be approximated
as a power of two, allowing the output to be adjusted with a
simple digital shift operation.
Fig. 10. Forwarding flow of Phase2 training
III. E XPERIMENTAL R ESULTS
Fig. 10 illustrates the forwarding flow of the Phase-2 A. Experimental Setup
training. Compared to the Phase-1, the Phase-2 includes
The experimental settings for our model training are shown
additional steps for the segmented convolution, quantization
below. We adopt the ADAM optimizer for all trainings. The
of partial sums, and summation of partial sums. The model
seed models used in model morphing are trained with the
output from the Phase-1 training serves as the baseline model
learning rate at 0.01 over 2000 epochs. The CIM aware
for the Phase-2 training.
morphing phase uses the learning rate at 0.05 over 100 epochs
Since the Phase 2 training involves the quantization of for the shrinking stage, and the learning rate at 0.01 over 100
partial sums, even minor variations in SW can directly affect epochs for the following fine-tuning stage, respectively. The
the size of the 4-bit quantized weights if SW is not fixed. ADC aware learning scaling adopts the learning rate at 0.001
This, in turn, can cause significant fluctuations in the partial with 100 epochs at the phase-1, and the learning rate at 0.01
sums, hindering model convergence. Therefore, in the Phase over 300 epochs at the phase-2, respectively.
2 training, SW is fixed, and the BN and convolution weights
are trained to adapt to the partial sum quantization.
By slightly modifying Eq. 6, we obtain the partial sum B. Analysis of Parameter Selection for the Model Morphing
quantization formula, as shown in Eq. 7. This formula The CIM aware model morphing has shown how to morph
primarily incorporates the ADC step size and sets the the model under the macro constraints. However, how to select
maximum and minimum clipping values according to the ADC the ratio of compression and expansion is crucial for model
precision, represented as −QNADC and QPADC . performance and hardware utilization of the CIM macro.
As an example to show the effect of compression ratio,
  Table I shows the accuracy of models with different
Qw · Input compression ratios after being expanded to the same parameter
output = round clip ,
SADC count and fine-tuned. The baseline model has 9.218M
−QNADC , QPADC )) · SW · SADC (7) parameters and an accuracy of 90.71%. The target for
expansion is set at 50% of the baseline parameters, totaling
W 4.609M. This table shows that excessive compression (e.g.
Qw = [round(clip( , −QN , QP ))] (8) pruning ratio > 0.9) will decrease performance due to a loss
SW
of important features. However, insufficient compression (e.g.
During the Phase-2 training process, only the BN and pruning ratio < 0.1) limits the effectiveness of expansion and
convolution weights are trained. The main goal is to adapt thus decreases performance as well. In addition to performance
the weights to the quantization of partial sums. The detailed concerns, these ratios also lead to different macro usage due
forward and backward methods are shown in Fig. 11. to macro constraints.
6

TABLE I focusing on wordline and bitline limitations, as well as


M ODEL COMPRESSION LIMIT quantization restrictions for weight cells and ADCs. These
Parameters Parameters Accuracy tables display accuracy based on CIFAR-10 test performance,
(Pruned) (Expanded) where BLs denotes the number of bitlines in the CIM
0.429M 4.611M 87.66% macro architecture (256 wordlines), and MACs represents
0.501M 4.607M 88.94%
0.691M 4.608M 89.70% the multiply-accumulate operations required for inference
1.014M 4.605M 90.70% (equivalent to ADC activations). The baseline model features
1.262M 4.609M 90.90% 4-bit quantized activations and was trained on CIFAR-10 for
1.993M 4.609M 90.90%
2000 epochs. Four models are created under varying bitline
2.445M 4.604M 90.70%
2.848M 4.610M 90.76% constraints, each undergoing three morphing rounds: a 150-
3.791M 4.607M 90.62% epoch compression phase and a 300-epoch fine-tuning phase,
4.049M 4.610M 90.32% both utilizing the ADAM optimizer (with learning rates of 0.05
and 0.01, respectively).
Table II shows the accuracy differences after expansion and In the tables, Morphed Model Accuracy indicates the
fine-tuning for two models with varying macro utilization rates model’s accuracy after compression. The Phase-1 Training
by a grid search on the parameters of the model morphing flow. shows accuracy after batch normalization (BN) folding
In this table, the top two rows are the best and worst macro and 4-bit weight quantization, while the Phase2 Training
usage when λ = 5E − 8. The bottowm two rows are the reflects accuracy after further 5-bit partial quantization. The
best and worst macro usage when λ = 3E − 8. The baseline partial sum storage and latency reduction presents model
model has 9.218M parameters and an accuracy of 90.71%. weights allocated in a CIM macro with 256 bitlines and
The target for model expansion is set at 8192 bitlines and 256 wordlines, each featuring 4-bit weight cells. Due to limited
wordlines, using the ADAM optimizer for both compression wordlines, 5-bit partial sums are generated, necessitating
and fine-tuning. During the 150-epoch compression phase, the additional storage, with Partial Sum Storage indicating the
learning rate is 0.05, and λ is gradually increased from 0 over maximum space required for these sums. Loading Weight
the first 100 epochs before being fixed for the last 50 epochs. Latency estimates the clock cycles needed to load weights;
Compressed models with the highest and lowest macro usage a CIM macro would require 256 cycles for this process.
are compared. After expansion, models are fine-tuned for 300 Lastly, Computing Latency denotes the clock cycles required
epochs at a learning rate of 0.01. for model inference. Convolution filters are divided into
smaller chunks that convolve with input channels sequentially,
TABLE II necessitating multiple passes through the wordlines. With
R ESULT OF DIFFERENT CIM MACRO USAGE MODEL FOR THE VGG-9
MODEL ON CIFAR-10.
only 64 ADCs available (4 bitlines per ADC), exceeding
64 simultaneous computations requires additional passes. The
Parameters Parameters Macro Usage Accuracy table provides the clock cycles needed for a CIM macro to
(Pruned) (Expanded)
1.154M 1.960M 93.46% 91.16% perform model inference.
1.203M 1.867M 88.53% 90.97% 2) Results: Tables III to V present the results for
1.255M 1.929M 92.00% 91.01% VGG9/VGG16/ResNet18 after model morphing and weight
1.413M 1.833M 87.41% 90.88%
adaptation, respectively. VGG9 comprises 8 convolutional
layers and 1 fully connected layer. VGG16 features 13
Table I shows that model performance declines when the convolutional layers and 1 fully connected layer. ResNet18
compression ratio falls below a certain threshold, e.g. 0.1 in has 17 convolutional layers and 1 fully connected layer. For
Table I. Therefore, below this threshold, it’s crucial to select a simplicity, only the convolutional layers are accelerated by the
model that retains feature representation rather than focusing CIM macros.
solely on CIM macro utilization after expansion. Thus, if the The model morphing results indicate that for models uti-
target macro size is less than 0.1 times the baseline model’s lizing over 4096 bitlines (aka. more parameters), reallocating
parameter count, it’s better to choose a model with higher resources improves accuracy (91.33% and 91.07% in VGG9,
accuracy during compression. In contrast, if the target macro 92.98% and 92.66% in VGG16, and 92.17% in ResNet18)
size exceeds 0.1 times the baseline count, there’s less risk of compared to the baseline, even with fewer bitlines and to-
losing feature representation, making it acceptable to select a tal MAC operations. This enhancement stems from pruning
model with higher CIM macro utilization. This strategy can redundant filters and reallocating excess bitline resources to
help achieve higher accuracy through resource reallocation. critical convolutional layers, resulting in more meaningful and
efficient weight storage and operations within the CIM macro.
C. End-to-End Performance The proposed morphing can also achieve high macro usage,
This subsection present the main results for latency, up to 94.54%, with small accuracy loss due to the CIM aware
accuracy, and compression across different models. constraints. The macro usage for ResNet18 is lower compared
1) Settings: To show the effectiveness of the proposed to the VGG models due to the higher number of convolutional
approach, the model adaption have been applied to different layers. Consequently, with a bitline limit of 4096, the model’s
models, VGG9, VGG16, and ResNet18, as shown in Tables accuracy is slightly declined. When the limit is reduced to 512,
III to V, tailored to the constraints of four CIM macros, accuracy decreases further to just 25% macro usage, resulting
7

in lower accuracy than the VGG models. Additionally, as the


number of parameters is decreased, quantization significantly
impacts accuracy, causing an extra 3.75% drop when the
bitline limit is 512.
The proposed quantization (P1 train and P2 train in the
tables) can achieve low accuracy loss for the bitline constraints
over 4096. The quantization loss will be increased for smaller
bitline constraints, which are reasonable since small model
size has low tolerance to quantization effects. The tables also
show that the proposed partial sum quantization (P2 train) has
introduced negligible loss compared to the weight quantization
(P1 train).
In the tables, the partial sum storage are reduced due to
model morphing except one case. With a bitline constraint
of 8192, partial sum storage for VGG16 is increasesd.
This occurs because the additional bitlines from pruning are
allocated to earlier layers, which are critical for accuracy.
These layers require more partial sum storage as their feature
maps have not undergone significant pooling.
The computing latency for all cases is reduced (26% to 86%
for VGG9, 30% to 89%,3% to 81% for ResNet18), which is
proportional to the reduction of the MACs due to the model
morphing. The latency to reload weight due to the limited
macro size is also reduced (79% to 99% for VGG9, 87% to Fig. 13. Mapping convolution weights into a CIM macro (model: VGG9, BL
99% for VGG16 and 82% to 99% for ResNet18), which has constraint: 1024)
higher reduction ratio than that in the computing latency due
to the CIM constraint. These ratios are proportional to the
reduction of the parameters and used BLs. ADCs, averaging 4.0 and 5.4 bits, respectively, with weights
Fig. 12 and 13 illustrate the mapping of the VGG9 model, fixed at 8 bits. It activates 64 wordlines simultaneously,
morphed under bitline constraints of 512 and 1024, onto a reducing parameters by 68.41% with 92.46% accuracy.
256x256 CIM macro. Different colors in the figures represent Compared to the previous approaches, our method begins
different convolutional layers. with 4-bit quantized activations and floating-point weights,
achieving over 90% compression through morphing and
quantizing while maintaining comparable accuracy. This
approach outperforms other methods in three aspects:
1) Parallelism: By using 4-bit parallel input and activating
256 wordlines simultaneously, our method leverages
ADC-aware training to handle higher quantization errors
from concurrent operations. This achieves up to 64x
speedup compared to E-UPQ and 16x compared to
XPert.
2) CIM Macro Utilization: Our method achieves nearly
90% utilization in VGG9 and VGG16, and 78.77% in
ResNet18 with a 4096-bitline constraint, compared to
just 13% in E-UPQ. This is due to directly pruning
Fig. 12. Mapping convolution weights into a CIM macro (model: VGG9, BL inefficient weights instead of storing them, making more
constraint: 512)
efficient use of CIM macro space.
3) Compression Rate: Through pruning and resource re-
allocation, our method improves accuracy and com-
D. Comparisons with Other Approaches pensates for quantization-induced losses, achieving over
Table VI compares three model adaptation methods using 90% model compression.
a model with a 4096-bitline constraint. E-UPQ [1] employs
mixed precision (8, 4, 2, 1, 0) for weights, resulting in an
average precision around 1 due to extensive pruning. It uses IV. C ONCLUSION
a 16x16 operation unit (OU), activating 16 wordlines at a CIM brings the benefits of highly parallel computation
time, and achieves about 87% weight reduction. XPert [2] uses and low power consumption but suffers from throughput
full floating-point operations in its baseline model, while its and performance bottlenecks due to extra weight loading for
compressed model adopts mixed precision for activations and limited memory array size and ADC quantization errors for
8

TABLE III
C OMPREHENSIVE R ESULTS FOR VGG9 WITH D IFFERENT BL C ONSTRAINTS

BL Param BLs MACs Macro Morphed Model P1 P2 Partial sum Load Weight Computing
Constraint (M) Usage Acc. Train Train Storage Latency Latency
Baseline 9.218 38592 724992 - 90.71% - - 163840 38656 14696
8192 1.971 (-79%) 8186 (-79%) 489248 (-33%) 93.98% 91.33% (+0.62%) 90.01% 89.83% 133056 (-19%) 8192 (-79%) 10928 (-26%)
4096 0.924 (-90%) 3907 (-90%) 358888 (-50%) 88.12% 91.07% (+0.36%) 89.77% 89.17% 107520 (-34%) 4096 (-89%) 9116 (-38%)
1024 0.210 (-98%) 1024 (-97%) 123792 (-83%) 80.11% 89.24% (-1.47%) 87.58% 87.39% 41984 (-74%) 1024 (-97%) 3020 (-80%)
512 0.098 (-99%) 511 (-99%) 85756 (-88%) 74.77% 87.71% (-3.00%) 85.47% 85.40% 39936 (-76%) 512 (-99%) 2108 (-86%)

TABLE IV
C OMPREHENSIVE R ESULTS FOR VGG16 WITH D IFFERENT BL C ONSTRAINTS

BL Param BLs MACs Macro Morphed Model P1 P2 Partial sum Load Weight Computing
Constraint (M) Usage Acc. Train Train Storage Latency Latency
Baseline 14.710 61440 1443840 - 92.02% - - 196608 61440 31300
8192 1.983 (-87%) 8148 (-87%) 986784 (-32%) 94.54% 92.98% (+0.96%) 92.73% 92.25% 245760 (+25%) 8192 (-87%) 21996 (-30%)
4096 0.952 (-94%) 3963 (-94%) 622032 (-57%) 90.83% 92.66% (+0.64%) 92.49% 91.88% 174080 (-11%) 4096 (-93%) 16192 (-48%)
1024 0.203 (-99%) 1021 (-98%) 259420 (-82%) 77.58% 89.96% (-2.06%) 88.66% 88.55% 106496 (-46%) 1024 (-98%) 6028 (-81%)
512 0.088 (-99%) 510 (-99%) 117408 (-92%) 67.07% 86.45% (-5.57%) 83.03% 84.50% 35840 (-82%) 512 (-99%) 3532 (-89%)

TABLE V
C OMPREHENSIVE R ESULTS FOR R ES N ET 18 WITH D IFFERENT BL C ONSTRAINTS

BL Param BLs MACs Macro Morphed Model P1 P2 Partial sum Load Weight Computing
Constraint (M) Usage Acc. Train Train Storage Latency Latency
Baseline 10.987 46400 690176 - 91.44% - - 65536 46592 16860
8192 1.804 (-84%) 8188 (-82%) 674344 (-2%) 86.01% 92.17% (+0.73%) 91.34% 90.99% 97280 (+48%) 8192 (-82%) 16296 (-3%)
4096 0.829 (-92%) 4088 (-91%) 411848 (-40%) 78.77% 91.37% (-0.07%) 90.40% 90.21% 66560 (+2%) 4096 (-91%) 12092 (-28%)
1024 0.132 (-99%) 997 (-98%) 145888 (-79%) 50.71% 86.16% (-5.28%) 84.37% 84.68% 57344 (-13%) 1024 (-98%) 3940 (-77%)
512 0.033 (-99.6%) 512 (-99%) 79760 (-88%) 25.37% 81.01% (-10.43%) 78.74% 77.26% 40960 (-38%) 512 (-99%) 3128 (-81%)

TABLE VI
C OMPARISON TABLE

E-UPQ [1] E-UPQ [1] XPert [2] This work


Model ResNet18 ResNet20 VGG16 VGG9 VGG16 ResNet18
Dataset CIFAR-100 CIFAR-10 CIFAR-10 CIFAR-10 CIFAR-10 CIFAR-10
Baseline 74.4% 91.3% 94.0% 90.7% 92.0% 91.4%
accuracy
Compressed 73.2% 90.5% 92.46% 89.17% 91.88% 90.21%
accuracy (-1.2%) (-0.8%) (-1.5%) (-1.5%) (-0.8%) (-1.23%)
Bit (Weight/ 1.0/8.0/4.0 1.1/8.0/4.0 8.0/4.0/5.4 4.0/4.0/5.0 4.0/4.0/5.0 4.0/4.0/5.0
Activation/ADC)
Memory cell 1 bit 1 bit 1 bit 4 bits 4 bits 4 bits
Compression ratio -87.50% -86.30% -68.41% -89.98% -93.53% -92.45%
Macro usage 12.50% 13.70% - 88.12% 90.83% 78.77%
Activated wordlines 16 16 64 256 256 256
Pruning ✓ ✓ × ✓ ✓ ✓
Adjustable × × × ✓ ✓ ✓
after pruning
ADC aware × × × ✓ ✓ ✓
training

partial sum. Addressing this problem, this paper has presented accuracy loss.
a two-stage process to adapt models to CIM constraints. The
first stage compresses and reallocates the weights to maximize
R EFERENCES
macro utilization and minimize weight loading while retaining
accuracy under the CIM array size constraints. The second [1] C.-Y. Chang, K.-C. Chou, Y.-C. Chuang, and A.-Y. Wu, “E-UPQ: Energy-
stage quantizes the model with the learning quantization aware unified pruning-quantization framework for CIM architecture,”
IEEE Journal on Emerging and Selected Topics in Circuits and Systems,
step size and ADC aware training to reduce the impact of vol. 13, no. 1, pp. 21–32, 2023.
quantization errors for partial sum accumulation. Compared to [2] A. Moitra, A. Bhattacharjee, Y. Kim, and P. Panda, “XPert: Peripheral
the previous approaches, the presented method achieves higher circuit & neural architecture co-search for area and energy-efficient xbar-
based computing,” in 60th ACM/IEEE Design Automation Conference
macro utilization, up to 90%, higher compression ratio, up to (DAC). IEEE, 2023, pp. 1–6.
93%, and more activated wordlines, up to 256, with lower [3] X.-J. Chen and C.-L. Yang, “CIMNet: Joint search for neural network
and computing-in-memory architecture,” IEEE Micro, pp. 1–12, 2024.
9

[4] C. Sakr and N. R. Shanbhag, “Signal processing methods to enhance


the energy efficiency of in-memory computing architectures,” IEEE
Transactions on Signal Processing, vol. 69, pp. 6462–6472, 2021.
[5] A. B. Sundar, J. Viraraghavan, and B. Vijayakumar, “Input-conditioned
quantisation for enob improvement in cim adc columns targeting large-
length partial sums,” IEEE Transactions on Circuits and Systems II:
Express Briefs, vol. 71, no. 6, pp. 2971–2975, 2024.
[6] J. Bai, W. Xue, Y. Fan, S. Sun, and W. Kang, “Partial sum quantization
for computing-in-memory-based neural network accelerator,” IEEE
Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 8,
pp. 3049–3053, 2023.
[7] Y. Kim, H. Kim, and J.-J. Kim, “Extreme partial-sum quantization for
analog computing-in-memory neural network accelerators,” ACM Journal
on Emerging Technologies in Computing Systems (JETC), vol. 18, no. 4,
pp. 1–19, 2022.
[8] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang,
and E. Choi, “Morphnet: Fast & simple resource-constrained structure
learning of deep networks,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2018, pp. 1586–1595.
[9] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and
D. S. Modha, “Learned step size quantization,” arXiv preprint
arXiv:1902.08153, 2019.

Ming-Han Lin received the M.S. degree in


electronics engineering from the National Yang
Ming Chiao Tung University, Hsinchu, Taiwan, in
2024. He is currently working in the NovaTek,
Hsinchu, Taiwan. His research interest includes deep
learning and Computing-In-Memory.

Tian-Sheuan Chang (S’93–M’06–SM’07) received


the B.S., M.S., and Ph.D. degrees in electronic
engineering from National Chiao-Tung University
(NCTU), Hsinchu, Taiwan, in 1993, 1995, and 1999,
respectively.
From 2000 to 2004, he was a Deputy Manager
with Global Unichip Corporation, Hsinchu, Taiwan.
In 2004, he joined the Department of Electronics
Engineering, NCTU (as National Yang Ming Chiao
Tung University (NYCU) in 2021), where he is
currently a Professor. In 2009, he was a visiting
scholar in IMEC, Belgium. His current research interests include system-
on-a-chip design, VLSI signal processing, and computer architecture.
Dr. Chang has received the Excellent Young Electrical Engineer from
Chinese Institute of Electrical Engineering in 2007, and the Outstanding
Young Scholar from Taiwan IC Design Society in 2010. He has been actively
involved in many international conferences as an organizing committee or
technical program committee member.

You might also like