0% found this document useful (0 votes)
21 views13 pages

1 s2.0 S0167739X25000779 Main

The document presents PIM-IoT, a novel hierarchical architecture for Internet of Things (IoT) systems that integrates Processing-in-Memory (PIM) to enhance energy efficiency and performance. It consists of three tiers: sensing, gateway, and edge computing, addressing the challenges of conventional cloud-based IoT systems, such as privacy, latency, and energy consumption. Experimental results indicate that PIM-IoT can achieve significant improvements in performance and energy savings for IoT applications.

Uploaded by

ajittpati1005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views13 pages

1 s2.0 S0167739X25000779 Main

The document presents PIM-IoT, a novel hierarchical architecture for Internet of Things (IoT) systems that integrates Processing-in-Memory (PIM) to enhance energy efficiency and performance. It consists of three tiers: sensing, gateway, and edge computing, addressing the challenges of conventional cloud-based IoT systems, such as privacy, latency, and energy consumption. Experimental results indicate that PIM-IoT can achieve significant improvements in performance and energy savings for IoT applications.

Uploaded by

ajittpati1005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Future Generation Computer Systems 169 (2025) 107782

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

PIM-IoT: Enabling hierarchical, heterogeneous, and agile


Processing-in-Memory in IoT systems
Kan Zhong a ,∗, Qiao Li b ,∗, Ao Ren c , Yujuan Tan c , Xianzhang Chen c , Linbo Long d ,
Duo Liu a
a
School of BigData and Software Engineering, Chongqing University, NO.174 Shazhengjie, Shapingba District, Chongqing, 400044, China
b School of Computer Science and Engineering, University of Electronic Science and Technology of China, NO. 2006, Xiyuan Ave, High-tech
Zone, Chengdu, 611731, China
c College of Computer Science, Chongqing University, NO.174 Shazhengjie, Shapingba District, Chongqing, 400044, China
d
School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, NO.2, Chongwen Road, Nan’an
District, Chongqing, 400065, China

ARTICLE INFO ABSTRACT

Keywords: The Internet of Things (IoT) is an emerging concept that senses the physical world by connecting various
Processing in memory ‘‘things’’ and objects to the Internet. Conventional cloud-based IoT systems are unlikely to keep up with the
Internet of Things diverse needs of IoT applications and have some issues, such as privacy and latency. Edge computing based
3-tier IoT system
IoT systems solve these issues by placing data processing and inference tasks near the data source. However,
IoT application modeling
due to the increasing complexity of IoT applications, performing data processing and inference tasks in edge
computing based IoT systems can lead to high energy consumption and latency.
Processing-in-Memory (PIM) is a promising solution to reduce the energy consumption of data processing
and inference tasks by closely integrating computational logics with memory device. Therefore, in this
paper, we propose PIM-IoT, a PIM architectures enabled IoT system to reduce the energy consumption. To
accommodate various data processing tasks, we architect PIM-IoT as a hierarchical system that consists of 3
tiers: sensing tier, gateway tier, and edge computing tier. We first analyze the dataflow of typical IoT applications
and map tasks to different tiers. To handle the data processing and inference tasks effectively in each tier,
we then propose hierarchical, heterogeneous, and collaborative PIM architectures for each tier. Finally, we
show how multi-tier can be co-optimized under latency and power constraints. To our knowledge, this is
the first work to explore novel PIM architectures in IoT systems. Detailed analysis and experimental results
show that PIM-IoT can achieve 5.6x performance improvement and 6x energy consumption reduction for IoT
applications.

1. Introduction connection to the cloud is usually unavailable [4]. Edge computing,


on the other hand, is a promising solution for IoT systems to process
The Internet of Things (IoT) enables the ability to acquire informa- and analyze sensing data near the data source, instead of transmitting
tion from the physical world using various sensors, including cameras, them to the centralized cloud [3,5–7]. Thus, this paper targets edge
sensor-enabled devices, wearables, etc. The conventional cloud-based computing based IoT systems and proposes a novel PIM based co-
approach is incapable of handling the diverse needs of IoT applica- optimization framework to reduce the energy consumption and latency
tions and manifests several disadvantages, including privacy, security, of IoT applications.
latency, and network bandwidth [1–3]. For example, cloud-based an-
With large-scale heterogeneous sensors being connected, IoT ap-
imal detection and recognition is impractical for automatic wildlife
plications become increasingly complicated. Different levels of data
monitoring in remote areas, where high bandwidth and low-latency

∗ Corresponding authors.
E-mail addresses: [email protected] (K. Zhong), [email protected] (Q. Li), [email protected] (A. Ren), [email protected] (Y. Tan),
[email protected] (X. Chen), [email protected] (L. Long), [email protected] (D. Liu).

https://doi.org/10.1016/j.future.2025.107782
Received 29 July 2024; Received in revised form 13 February 2025; Accepted 18 February 2025
Available online 28 February 2025
0167-739X/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

Fig. 1. Data movement energy in typical IoT tasks. Fig. 2. Fraction of bitwise operations in typical encryption and compression algorithm.

processing tasks, such as data encryption, compression, fusion, and which enables PIM with flexible coarse-grained reconfigurable archi-
aggregation, are leveraged in IoT applications to improve security and tecture that is capable of providing both pipeline and serial execution
eliminate redundancy. The emerging deep learning algorithms, such as modes (i.e., executon mode, 𝐸𝑚 ) as well as exploiting various degrees of
convolution neural networks (CNNs) and deep neural networks (DNNs) parallelism (i.e., number of PE, 𝑁𝑝 ) for IoT applications.
are also likely to be adopted to perform robust and reliable inference To explore neural networks based inference tasks in the edge com-
tasks in IoT applications [6,8–10]. These diverse IoT tasks are often puting tier, we leverage the approximate computation capability of
distributed through the entire dataflow of IoT applications. Thus, a memristor crossbars, and propose Approx-PIM. Since the analog-to-
hierarchical IoT system with multiple tiers is required to accommodate digital converter (ADC) and digital-to-analog converter (DAC) are the
these data processing and inference tasks, which can be mapped to major energy consumption contributors in Approx-PIM, we propose a
specific tiers. However, performing memory-/computing-intensive data novel Output-Input driver and an analog activation function that are
processing and inference tasks near the edge imposes a grand challenge adaptive with different quantization precision (i.e., precision mode, 𝑃𝑤 )
on energy consumption as most IoT devices are powered by battery to reduce the usage of DACs/ADCs.
or energy harvester. Especially, the data movement between the main To our knowledge, this is the first work to explore PIM architectures
memory system and the computation unit (i.e., CPU) is a major energy in IoT systems. Experimental results show that PIM-IoT can achieve
contributor in IoT systems. Prior work [11] suggests that 62.7% of the 8x performance improvement and 5.4x energy consumption reduction
total system energy is spent on data movement in wearable devices. for IoT applications. The rest of this paper is organized as follows.
In our evaluation, the energy consumed by data movement in the Section 2 first introduces the background of PIM and IoT systems, then
conventional architecture for typical IoT tasks accounts 40%–60% of presents the motivation. Section 3 discusses the proposed 3-tier PIM-IoT
the total energy consumption, as shown in Fig. 1. system and the dataflow of typical IoT applications. Section 4 proposes
Processing-in-memory (PIM) is a promising solution to achieve the hierarchical, heterogeneous, and agile PIM architectures for each
energy-efficient computing by closely integrating computational logic IoT tier. Section 5 evaluates the proposed PIM-IoT system. Section 6
with memory. Since it is reported that reading 64-byte data from off- summarizes the related work and Section 7 concludes the paper.
chip memory to CPU consumes two orders of magnitude more energy
than a floating-point operation [12], PIM can significantly reduce the
2. Background and motivation
energy consumption and data access latency of tasks by alleviating or
even eliminating the data movements between computational logics
In this section, we first introduce the edge computing enabled IoT
and off-chip memory. This is especially beneficial to memory-intensive
system and then present the background of processing in memory.
tasks, where data movements between computational logic and mem-
ory are the major performance and energy bottlenecks [11,13]. This
2.1. Edge computing based IoT system
motivates us to explore a novel PIM design and runtime manage-
ment framework to facilitate energy-efficient and high-performance IoT
IoT applications usually produce a large amount of data. For ex-
systems.
ample, in surveillance applications, image and video are collected for
To this end, we propose PIM-IoT, a PIM supported hierarchical
IoT architecture. PIM-IoT consists of 3 tiers and tasks are mapped object recognition and anomaly detection; in healthcare applications,
to different tiers: the sensing tier is responsible for data collection physiological parameters, such as blood pressure and pulse, are sensed
and performs data encryption and compression; the gateway tier pro- for early detection of clinical deterioration. Conventionally, leveraging
vides connections to sensor nodes and performs encoding, data fusion, high-performance servers, IoT data is processed and analyzed in the
and aggregation; the edge computing tier enables edge computing cloud. However, cloud-based IoT systems have several critical pitfalls:
capability and performs various inference tasks. To effectively map (1) uploading a large amount of IoT data can lead to high energy
hardware resources to different tasks across multiple tiers, this paper consumption; (2) high bandwidth and stable connections are not always
proposes hierarchical, heterogeneous, and agile PIM architectures that available and even impractical in some scenarios, making time-critical
can leverage workload characteristics in each tier. applications exhibit long response time; (3) sending personal and sensi-
In the sensing tier, sensor nodes frequently perform data compres- tive data is at risk of privacy leakage. Therefore, IoT systems tend to be
sion and encryption operations before transmitting data to gateway equipped with edge computing capability, which places the application
nodes. Both compression and encryption operations heavily rely on closer to the end users and avoids data uploading by processing data
bitwise operations like AND, OR, XOR, and NOT, as shown in Fig. 2. near the source [3,5–7].
In particular, the fraction of bitwise operations in SHA-256 is more
than 50%. To effectively perform energy-efficient data encryption and 2.2. Processing-in-Memory (PIM)
compression in IoT sensor nodes, we therefore explore Bitwise-PIM,
a PIM architecture with adaptive write performance (i.e., write delay, The data movements between processing units and memory become
𝐷𝑤 ) to conduct in-memory bitwise operations. To reduce the energy the major performance and energy bottleneck for memory-intensive
consumption of bitwise operations in Bitwise-PIM, we propose a novel IoT tasks, such as image processing and deep neural networks. It
performance-guaranteed voltage scaling technique for identifying the is reported that transferring data from off-chip memory through the
optimal write voltage under a tunable write delay constraint. cache hierarchy to the CPU consumes two orders of magnitude more
For the gateway tier, it usually needs to execute multiple pipelined energy than a floating-point operation [12]. One of the promising
data level parallelism (DLP) tasks. In this study, we explore CGRA-PIM, solutions to tackle this issue is to add architecture support to move

2
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

Fig. 4. IoT application dataflow.

3.1. 3-Tier IoT system

Fig. 3. A typical 3-tier IoT system. IoT systems provide an infrastructure to connect various things
and extract useful information through data analytics. Data perception,
transmission, and analysis become the basic capabilities of IoT systems.
To handle various sensors and process sensor data efficiently, IoT
memory-intensive computations closer to memory. This approach is
systems should be carefully designed and partitioned.
generally referred to as Processing-in-Memory (PIM) or Near Data Com-
Fig. 3 shows the targeted IoT system hierarchy. Based on the de-
puting (NDC). In this paper, we refer to this approach as PIM. The
mands and characteristics of IoT workloads, we partition the IoT system
high-level idea behind PIM concept is to have computational units
into 3 tiers targeting different functionalities: sensing tier, gateway tier,
closely integrate with memory such that data can be transferred from and edge computing tier. Table 1 summarizes the characteristics of
memory to these units at much lower overhead (e.g., lower latency these tiers. The sensing tier is responsible for gathering information
and energy consumption). To realize PIM, many prior techniques [14– from the physical world in real time by using various sensors, such
18] have been proposed to integrate dedicated computation logics into as cameras and embedded sensors. For instance, cameras send RAW
the memory chip to perform specific data manipulation. In this paper, images to the IoT gateway through a wired connection, while embed-
we propose hierarchical, heterogeneous, and agile PIM architectures ded sensors send sensor data to the IoT gateway through Bluetooth.
to reduce power consumption and improve the performance of IoT Embedded sensors usually feature a microcontroller unit (MCU) for
systems. data sensing and compression. Gateway tier uses IoT gateways to
connect various sensors, and collect sensor data. Image/video encoding
2.3. Motivations and data aggregation, such as image data fusion are the common tasks
for IoT gateways. To enable the edge computing capability, the IoT
edge computing node is introduced as the third tier of the IoT system
In this paper, we propose PIM-IoT based on the following motiva- hierarchy. The edge computing node performs real-time data analysis
tions. and allows fast response if necessary. Deep learning techniques are
M1. IoT devices have a limited power budget. IoT devices are also integrated to perform robust and reliable inference tasks, such as
usually powered by a battery or energy harvester, but are increasingly pattern and object recognition.
being used to perform memory-/computing-intensive tasks. Therefore,
energy-efficient operation is critical for these energy-constrained IoT 3.2. IoT application modeling
devices. However, in the von Neumann architecture, the movement
of data between memory and compute units imposes significant en- We have created an IoT model to analyze the latency of dataflow
ergy overheads. We have evaluated the energy consumption of data through the three IoT tiers. IoT applications are typically composed
movement in the conventional architecture for typical IoT tasks. The as dataflow graphs, with vertices representing data processing tasks
results are shown in Fig. 1. The energy consumed by the movement of and edges denoting streams of data blocks that can be passed between
data accounts for 40%–60% of the total energy consumption. Therefore, them [22]. Fig. 4 shows the IoT data processing data flow. In the
reducing the energy consumption of data movement in IoT devices figure, 𝑐𝑖 and 𝑝𝑖 respectively denote the input and output data size of
becomes a critical mission. Since PIM has been shown to have high task 𝑖. Data processing tasks can be deployed in different tiers based
potential in reducing the data movement overhead, we choose to on the main functions of each tier. For example, compression and
design PIM architectures for each IoT tier to achieve energy-efficient encryption tasks are deployed in the sensor tier, data aggregation tasks
computing. are deployed in the gateway tier. Data blocks from/to the stream are
M2. Many IoT applications are time-critical. Many IoT scenarios, consumed/produced by the tasks. The ratio of output block size to input
data block size of a task is called selective ratio, expressed as 𝜎 = 𝑝𝑖∕𝑐𝑖 .
like industrial automation, healthcare monitoring, and autonomous
Let 𝑡𝑘𝑖 denote the execution time of task 𝑖 when consuming one basic
vehicles, demand real-time data analysis and response, meaning actions
data block at the 𝑘th tier. For each basic data block, the processing time
need to be taken almost instantly based on incoming sensor data.
𝑇𝑛𝑘 of task 𝑛 at the 𝑘th tier is:
For example, autonomous vehicles need to make immediate driving
decisions in 1–10 ms; healthcare monitoring requires near real-time ∏
𝑛−1

(100–500 ms) monitoring of patients’ vital signs and processing of 𝑇𝑛𝑘 = 𝜎1𝑘 × 𝜎2𝑘 × ⋯ × 𝜎𝑛−1
𝑘
× 𝑡𝑛 = 𝑡𝑛 × 𝜎𝑖𝑘 . (1)
𝑖=1
data to provide timely diagnosis and emergency response. However, For an IoT application that has 𝑁 processing tasks at the 𝑘th tier,
the general-purpose processing unit found in most IoT devices not the total processing delay can be determined by:
only has limited computing power but is also not suitable for com- [ ]
plex data-intensive IoT tasks, posing a challenge for time-critical IoT ∑
𝑁 ∏
𝑛−1
𝑇𝑑𝑘𝑒𝑙𝑎𝑦 = 𝑡𝑘1 + 𝑡𝑘𝑛 × 𝜎𝑖𝑘 . (2)
applications. 𝑛=2 𝑖=1

3.3. Dataflow of typical IoT application


3. An analysis of IoT system
Image Classification Applications. Fig. 5 shows a detailed
In this section, we first introduce the architecture overview of the dataflow of image classification, which is widely used in IoT appli-
3-tier IoT system, and then present the IoT model. Finally, we demon- cations, such as wildlife monitoring and video surveillance. To have
strate two typical IoT applications, which will be used as benchmarks better quality, images are often fused and enhanced by using both
in the evaluation. RGB camera and IR camera, which respectively capture visible images

3
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

Table 1
Characteristics of different tiers.
Features IoT sensing tier IoT gateway tier IoT edge computing tier
Function Data collection, encryption, compression, Image/video compression and encoding, Data analytics, deep learning based
communication sensor data fusion or aggregation pattern and object recognition
Typ. task AES, SHA, LZW JPEG encoding, data fusion/aggregation CNN-based image classification
Connection Wired and BLE(5 ∼ 35 Kbps) WiFi (54 Mbps ∼ 1 Gbps) Ethernet (to Cloud) (∼1 Gbps)
Processor MCU, 10 ∼ 200 MHz ARMv7/v8, 700 MHz ∼ 2.4 GHz ARMv7/v8, 1.5 ∼ 2.8 GHz + GPU
Memory ∼100s KB SRAM 256 MB ∼ 1 GB DRAM 1 ∼ 8 GB DRAM
Example Raspberry Pi Pico@133 MHz [19] Raspberry Pi [email protected] GHz [20] Jetson Orin Nano, 512 CUDA cores [21]

Fig. 7. Overview of PIM enabled IoT sensor node.

Fig. 5. Dataflow of image classification applications.

4. PIM to rescue: PIM-IoT

We solve the energy issue of IoT systems with hierarchical, hetero-


geneous, and agile PIM architectures. In this section, we first introduce
the overview of PIM architectures and then present the detailed PIM
architecture design.

4.1. PIM-IoT overview


Fig. 6. Dataflow of wearable health monitoring.
IoT sensor tier. Due to the zero leakage power and low access
latency, spin-torque transfer magnetic RAM (STT-RAM) shows a high
and thermal images. IoT gateway performs image fusion by combining potential in reducing the energy consumption in IoT sensor nodes [23–
visible image and thermal image, and then encodes the combined 25]. Leveraging the resistive nature of STT-RAM, we explore Bitwise-
image. Finally, the combined image is sent to the IoT edge computing PIM to accelerate bitwise operations as they are widely used in data en-
node, which leverages CNNs to perform image classification. As shown, cryption and compression. To reduce the write energy in Bitwise-PIM,
discrete wavelets transform (DWT) based image fusion is applied to we propose a novel performance-guaranteed voltage scaling technique
the images collected by RGB camera and IR camera. To reduce data to identify the optimal voltage under a tunable STT-RAM write delay
transmission, the merged image is encoded as JPEG file and then sent (i.e., write delay, 𝐷𝑤 ) constraint.
to the edge computing node for image classification. In Fig. 5, each data IoT gateway tier. IoT gateway tasks, such as image/video encod-
processing task has a specific selective ratio (𝜎), and the image size is ing, data fusion, and aggregation are usually memory-intensive and
fixed. Thus, the total data processing delay in the IoT gateway tier can exhibit massive DLP. General-purpose processors pay a big price to re-
be calculated based on Eq. (2) using the selective ratio. duce power for these tasks. Coarse-grained reconfigurable architectures
(CGRAs), which contain multiple processing units, have been shown to
Wearable Healthcare Monitoring. Wearable health monitoring
bring significant performance and energy-efficiency benefits by exploit-
enables the early detection of key health risk factors to reduce the
ing DLP in IoT gateway tasks [26]. We therefore exploit CGRA-PIM,
chance of serious or chronic illness. Wearable devices (i.e., sensor
a flexible CGRA-based PIM architecture with multiple PEs. To make
nodes) placed at the wrist, waist, chest, head, etc., can continuously
CGRA-PIM flexible to different performance and power requirements,
collect activity, physiological, and biomechanical parameters, such as
we make CGRA-PIM capable of providing pipelined execution and serial
electroencephalography (EEG), electrooculography (EOG), heart rate,
execution mode (i.e., execution mode, 𝐸𝑚 ) as well as be tuned with
and blood pressure.
different execution PEs (i.e., number of PEs, 𝑁𝑝 ).
Fig. 6 illustrates a typical dataflow of healthcare monitoring. To IoT edge computing tier. Leveraging the error resilience of neural
reduce data transmission or improve security, health parameters are networks, we explore Approx-PIM, which applies approximate comput-
compressed or encrypted (𝜎 = 0.4) before sending to the IoT gateway. ing in CNNs. We configure Approx-PIM to support two quantization
In the IoT gateway, filtering algorithms (𝜎 = 0.8) are applied to precisions (i.e., precision mode, 𝑃𝑤 ): 4-bit and 8-bit. The DACs and ADCs
these parameters to eliminate abnormal values, and then the same in Approx-PIM exhibit high energy consumption, the power of which
parameters from multi-sensors are averaged (𝜎 = 0.5) to improve increases exponentially with resolution and frequency. Existing works,
accuracy. For example, the average physiological temperature readings such as PipeLayer [16], eliminate ADCs/DACs by using pure analog
could be taken from multiple body parts to find a single best value. For signals. However, this approach sacrifices flexibility and complicates
some parameters, only the maximum or minimum value is concerned, the implementation of the pooling algorithm. To solve this challenge,
leading to very small 𝜎. Different data fusion algorithms can be adopted we propose a novel Output-Input driver and analog activation func-
to extract the concerned features or to integrate the raw parameters tion unit that are adaptive with different quantization precisions to
from multiple sensors. In this application, the IoT edge computing node reduce the energy consumption by eliminating ADCs/DACs between
performs long-term monitoring diagnosis using various models. two convolution layers.

4
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

The current, which equals to 𝐼𝐶 , flows through the access transistor


can be modeled as [28]:

𝐼𝐶 = 𝐾 × 𝑊 × (𝑉𝐺𝑆 − 𝑉𝑇 𝐻 + 𝜆𝐷𝐼 𝐵 𝐿 × 𝑉𝐷𝑆 ), (4)

where 𝑉𝑇 𝐻 is the transistor threshold voltage, 𝜆𝐷𝐼 𝐵𝐿 is the drain-


induced barrier lowering (DIBL) coefficient. 𝐾 is the intrinsic driving
strength, which is set by the transistor technology, and 𝑊 is the
Fig. 8. In memory bitwise operation design concept. (a) Enabling two rows simulta-
neously. (b) Determining sensing currents. (c) Current sensing circuitry design. transistor width.
The required STT-RAM cell switching voltage (i.e., 𝑉𝑟𝑒𝑞 ) can be
expressed as:
4.2. Bitwise-PIM for sensing tier 𝐾 𝑊 (𝑉𝐺𝑆 − 𝑉𝑇 𝐻 ) − 𝐼𝐶
𝑉𝑟𝑒𝑞 = 𝑉𝐷𝑆 + 𝐼𝐶 × 𝑅𝑀 𝑇 𝐽 = + 𝐼𝐶 × 𝑅𝑀 𝑇 𝐽 . (5)
𝐾𝑊
4.2.1. STT-RAM based in-memory bitwise operations According to [28], the optimal voltage that can achieve minimum
Fig. 7 shows the overview of Bitwise-PIM in the IoT sensing tier. switching energy can be expressed as:
[ √ ]
STT-RAM is adopted as the main memory of the MCU. A typical STT-
𝑉
RAM cell is made up of a magnetic-tunnel junction (MTJ) and an access 𝑉𝑜𝑝𝑡 = 𝑉𝐶0 × 1 + 1 − 𝑇 𝐻 , (6)
𝑉𝐶0
transistor. The resistance of an MTJ (i.e., 𝑅𝑃 or 𝑅𝐴𝑃 ) is used to present
the binary value stored in the cell. To read the value from the cell, where 𝑉𝐶0 is the minimum supply voltage required to deliver the
a small negative voltage is applied between 𝐵 𝐿 and 𝑆 𝐿, and 𝑊 𝐿 switching threshold current 𝐼𝐶0 . To reduce the cell write energy while
is enabled. The amount of current, which depends on the resistance ensuring the write performance, we scale the STT-RAM cell write
of MTJ, flowing through the cell is sensed by a sense amplifier and voltage (i.e., 𝑉𝐷𝐷 ) to:
compared to a reference current to determine the logic state of the cell.
𝑉𝐷𝐷 = 𝑚𝑎𝑥(𝑉𝑟𝑒𝑞 , 𝑉𝑜𝑝𝑡 ). (7)
Fig. 8 shows the design concept of in-memory bitwise operations.
Since the STT-RAM read voltage (i.e., 𝑉𝑟𝑒𝑎𝑑 ) is constant, the sense This means if the switching time required voltage (i.e., 𝑉𝑟𝑒𝑞 ) is higher
amplifier leverages the current to determine the logic value. than 𝑉𝑜𝑝𝑡 , we set the STT-RAM cell write voltage (i.e., 𝑉𝐷𝐷 ) to 𝑉𝑟𝑒𝑞 to
OR operation: In order to perform OR operation, the sense amplifier guarantee write performance, otherwise, the write voltage is set to 𝑉𝑜𝑝𝑡
to minimize the write energy.
should output ‘0’ when both cells’ resistance are 𝑅𝐴𝑃 (high resistance),
otherwise, it should output ‘1’. To achieve this, we choose the reference
current (i.e., 𝐼𝑟𝑒𝑓 _𝑜𝑟 ) to be a value between 2𝐼𝐴𝑃 and 𝐼𝐴𝑃 +𝐼𝑃 , as shown 4.3. CGRA-PIM for gateway tier
in Fig. 8(b).
4.3.1. CGRA-PIM overview
AND operation: Similarly, to realize AND operation, the sense ampli- Fig. 9(a) illustrates the overview of the proposed CGRA-PIM, which
fier outputs ‘1’ only when both cells’ resistance are 𝑅𝑃 , thus the current is a coarse-grained reconfigurable PIM architecture. CGRA-PIM consists
flows through 𝑆 𝐿 should be 2𝐼𝑃 . As shown in Fig. 8(b), we therefore of 32 PEs. Note that the number of PEs can be scaled up according
set the reference current (i.e., 𝐼𝑟𝑒𝑓 _𝑎𝑛𝑑 ) to between 𝐼𝐴𝑃 + 𝐼𝑃 and 2𝐼𝑃 . to the number of tasks in IoT gateway. We stack CGRA-PIM on top
XOR operation: The XOR operation is realized by combining OR and of conventional main memory to achieve ‘‘near-memory’’ computation.
AND operation. XOR operation can be expressed as 𝐴 ⊕ 𝐵 = (𝐴 + 𝐵) ⋅ To hide the access latency of main memory, we feature CGRA-PIM with
¬(𝐴 ⋅ 𝐵). Therefore, one extra sense amplifier (i.e., 𝑆 𝐴2 in Fig. 8(c)) is an on-chip cache. The on-chip cache is built with multiple SRAM banks
required to perform AND operation simultaneously, and a CMOS-based matching the number of PEs. The connections between PEs and SRAM
AND logic gate is required to produce the final result. banks are through a switch box, which redirects a PE to a certain SRAM
bank. Two nearby PEs are connected. Each PE contains six function
INV operation: The INV operation is realized by simply inverting the units (denoted by ‘FU’) that perform computation and two memory
logic value read from the cell, which is controlled by 𝐶 𝑇 𝑅𝐿 in Fig. 8(c). units (denoted by ‘M’) that are responsible for load/store instruction.
The detailed sensing circuitry design is shown in Fig. 8(c). It consists As shown, each FU is directly connected to a memory unit in order to
of two sense amplifiers, two multiplexers, and one CMOS-based AND access the SRAM banks effectively. Fig. 9(b) shows the details of an FU.
logic gate. For OR, AND, and INV operations, only 𝑆 𝐴1 is used and the The register file (RF) in each PE is used to store temporary values.
output of 𝑀 𝑈 𝑋2 is always 1. For XOR operation, both 𝑆 𝐴1 and 𝑆 𝐴2 As the memory access in conventional CGRA is a performance
are used, and the result of AND operation in 𝑆 𝐴2 is inverted before bottleneck, we address this issue by featuring each PE with an inde-
output. pendent SRAM bank and memory controller. Through the switch box,
each PE can connect to one selected SRAM bank based on the data
4.2.2. Performance guaranteed voltage scaling replacement and task assignment. Fig. 9(c) shows the details of the
In STT-RAM, the cell switching time is related to the write voltage memory controller in a PE. When memory access units in a PE issue
(i.e., 𝑉𝐷𝐷 in Fig. 7). A higher voltage can reduce the STT-RAM switch- memory access requests, the access scheduler first dispatches these
ing time, but may lead to higher energy consumption. Therefore, we requests to a FIFO queue, then the memory address is generated by the
propose a novel performance guaranteed voltage scaling method to find address calculator. Finally, the memory address as well as the command
an optimal write energy under STT-RAM write time constraint. are issued to the selected SRAM bank.
Let 𝑇𝑡ℎ denote the MTJ switching threshold time, which guarantees The dataflow of IoT applications is comprised of multiple data
the write performance. The required MTJ write current in precession processing tasks, especially in IoT gateways. It is natural to execute
switching mode can be written as [27]: these tasks in a pipelined manner to maximize resource usage and
throughput. As shown in Fig. 10, multiple PEs can be grouped into a
𝐶 ln 𝜋∕2𝜃
𝐼𝐶 (𝑇𝑡ℎ ) = 𝐼𝐶0 + , (3) PE group to execute a specific data processing task. IoT data blocks are
𝑇𝑡ℎ processed in a task pipeline. The main challenge of pipeline execution
where 𝐼𝐶0 is the switching threshold current that causes magnetization is: how to divide PEs into PE groups to make tasks have the same execution
switching in a nanomagnetic device. time. We discuss how to address this challenge in the following.

5
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

Fig. 9. CGRA-PIM in IoT gateways. (a) Architecture overview. (b) FU in PEs. (c) Memory controller in PEs.

Fig. 11. Memristor crossbar. (a) Vector-matrix multiplication. (b) Inputs and synaptic
Fig. 10. Example of pipelined data processing tasks. weights composing scheme.

Gateway tasks usually exhibit massive data level parallelism (DLP),


thus each PE can be treated as an SIMD core and the communication
cost between PEs can be ignored. We assume the task execution time
is inversely proportional to the number of PEs assigned to that task.
Then the execution of time of task 𝑖 on processing a data block can be
modeled as:

𝑇𝑖 (𝑝𝑖 ) = 𝐴𝑖 × 𝑝𝑖 + 𝐶𝑖 , (8)

where 𝑝𝑖 denotes the number of PEs for task 𝑖, 𝐴𝑖 and 𝐶𝑖 are generalized
parameters that may vary for different tasks. Fig. 12. Architecture overview of Approx-PIM.
To make each task have the same execution time, we have the
following equation:
⎧ ∑ 4.4.2. Approx-PIM overview
𝑁
⎪ 𝑝 = 𝑃; Fig. 12 illustrates the overview of the proposed Approx-PIM, which
⎪ 𝑖=1 𝑖 consists of a number of memristor crossbar based PEs. PEs are arranged
⎨ (9)

𝑁−1
⎪ 𝑇 (𝑝 ) = 𝜎 𝑇 (𝑝 ) = 𝜎 𝜎 𝑇 (𝑝 ) = ⋯ = 𝜎 𝑇 (𝑝 ).
in a centralized mesh (c-mesh) manner. Routing controller controls
⎪ 1 1 1 2 2 1 2 3 3 𝑖 𝑁 𝑁
the communications among PEs. Precision mode configuration module
⎩ 𝑖=1

By solving the above equation, we can determine the number of PEs configures PEs to support the selected precision mode. Each PE contains
assigned to each task, and make each task has the same execution time. four memristor crossbars to perform analog multiply-add operations,
In addition to pipeline execution, all PEs can be assigned to a single while each crossbar contains 128 rows and 128 columns. We also add a
task for serial execution of IoT tasks, where the next task is scheduled memristor-based ReRAM buffer to store the input values, weights and
only after the current one finishes. For DLP tasks, this mode can reduce intermediate results. Digital rectified linear unit (ReLU) and pooling
execution time by utilizing all PEs. However, in some cases, not all unit, which can be realized by FPGA or digital circuits, respectively
PEs are needed for simpler tasks, and the power budget of an IoT implement digital activation function and pooling algorithm. The pro-
gateway may not support activating all PEs at once. Thus, determining posed Approx-PIM is only used for inference tasks. The well-trained
the number of PEs to activate for IoT gateway tasks is essential. We weights are loaded to the memristor cells using the program drivers
treat the number of activated PEs and the execution mode as tunable associated with each crossbar.
parameters. A control method for determining the number of active PEs As shown Fig. 12, each memristor crossbar contains an ADC and
and selecting the execution mode is discussed in Section 5. DAC array, a program driver, a proposed output-input driver (i.e., O-I
Driver), and a proposed analog ReLUs (i.e., A-ReLU). In 4-bit precision
4.4. Approx-PIM for edge computing tier mode, ADCs and DACs are only used in pooling layers, the proposed
output-input driver and analog ReLU eliminate AD/DA conversion
4.4.1. CNN and memristor crossbar between convolution layers. To support 8-bit precision mode, one 8-
In memristor crossbar, as shown in Fig. 11(a), the conductivity of bit input is arranged into two crossbars. Therefore, a shift-add unit
the memristor cell can be programmed as the synaptic weights, while between two adjacent crossbars is used to generate the 8-bit result.
the input voltage can be tuned to the input neurons. Thus, an analog Since the output voltage of one crossbar maybe be served as the input
vector-matrix multiplication, which is used in CNNs can be performed voltage of another crossbar, a signal router is required to configure the
as: connection between crossbars.

𝑁
𝑖𝑜𝑢𝑡
𝑗 = 𝑔𝑗 ,𝑘 × 𝑣𝑖𝑛
𝑘, (10) 4.4.3. Input and synapse composing
𝑘=0
The proposed Approx-PIM supports both 4-bit and 8-bit precisions,
where 𝑣⃗𝑖𝑛 is the input voltage vector, 𝑖𝑜𝑢𝑡
⃗ is the output current vector although 4-bit fixed point input and synaptic weight precision is
and 𝑔𝑗 ,𝑘 is the conductivity of memristor matrix. enough for CNN-based image classification [29] in IoT system, we use

6
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

Table 2
Tunable PIM parameters in PIM-IoT.
Arch Bitwise-PIM CGRA-PIM Approx-PIM
PIM STT-RAM 1. Number of PE Precision
parameter write delay 2. Execution mode
Symbol 𝐷𝑤 , < 𝑁𝑝 , 𝐸𝑚 >, 𝑃𝑤 ,
{
1 ≤ 𝐷𝑤 ≤ 9 ns 1 ≤ 𝑁𝑝 ≤ 32, 1 4 𝑏𝑖𝑡;
{ 𝑃𝑤 =
Fig. 13. The output current is converted to input voltage. (a) Output-Input driver. (b) 1 𝑃 𝑖𝑝𝑒𝑙𝑖𝑛𝑒; 0 8 𝑏𝑖𝑡.
𝐸𝑚 =
Analog ReLU. 0 𝑆 𝑒𝑟𝑖𝑎𝑙.

the 8-bit precision for CNN applications that require high accuracy. To 4.4.5. Analog activation function
represent 8-bit fixed point inputs and synaptic weight, we use two 4- To realize the activation function in CNNs in analog, we propose
bit inputs and synaptic weights, respectively. Generally, a 8-bit input an novel analog ReLU, which is an analog activation function unit
𝐼 𝑘 and weight 𝑊 𝑘 can be respectively expressed as: that implements the activation function 𝑦 = 𝑚𝑎𝑥(0, 𝑥). In 4-bit preci-
sion mode, we use a pair of memristor cells to represent a synaptic
𝐼 𝑘 = 𝐼𝑙𝑘 + 24 ⋅ 𝐼ℎ𝑘 , 𝑊 𝑘 = 𝑊𝑙𝑘 + 24 ⋅ 𝑊ℎ𝑘 , (11)
weight. The synaptic weight can be expressed as the difference between
where, 𝐼𝑙𝑘 and 𝐼ℎ𝑘 respectively denotes the low 4-bit and high 4-bit of 𝐼 𝑘 , the conductances of the two memristors in the positive and negative
while 𝑊𝑙𝑘 and 𝑊ℎ𝑘 respectively denotes the low 4-bit and high 4-bit of crossbar arrays:
𝑊 𝑘 . As shown in Fig. 11(b), the low-bit part and high-bit part of 𝐼𝑘 are 𝑤𝑗 ,𝑘 = 𝑔𝑗+,𝑘 − 𝑔𝑗−,𝑘 , (14)
served as the inputs of two rows in two different crossbars, the low-bit
part and high-bit part of 𝑊𝑘 are stored in two adjacent memristor cells. where 𝑔𝑗+,𝑘 and 𝑔𝑗−,𝑘 respectively represent the conductances in the two
Therefore, the multiplication of input and synaptic weight consists of memristor cells. Thus, there are two output voltages converted from
+ − for the
four components: 𝐼𝑙𝑘 𝑊𝑙𝑘 , 𝐼𝑙𝑘 𝑊ℎ𝑘 , 𝐼ℎ𝑘 𝑊𝑙𝑘 , and 𝐼ℎ𝑘 𝑊ℎ𝑘 . The multiplication the output currents: 𝑉𝑜𝑢𝑡 for the positive crossbar array and 𝑉𝑜𝑢𝑡
negative crossbar array.
can be expressed as: +
To realize ReLU, we leverage a voltage comparator to compare 𝑉𝑜𝑢𝑡
𝑅𝑘 = 𝐼 𝑘 × 𝑊 𝑘 −
and 𝑉𝑜𝑢𝑡 , and an op-amp to generate the voltage difference. Fig. 13(b)
(12)
= 𝐼𝑙𝑘 𝑊𝑙𝑘 + 24 ⋅ 𝐼𝑙𝑘 𝑊ℎ𝑘 + 24 ⋅ 𝐼ℎ𝑘 𝑊𝑙𝑘 + 28 ⋅ 𝐼ℎ𝑘 𝑊ℎ𝑘 . shows the detailed design of analog implementation of ReLU. The goal
+ − ) if 𝑉 + > 𝑉 − , otherwise, it outputs
For a memristor crossbar containing 2𝑁 rows (i.e., 2𝑁 inputs), the of the circuit is to output (𝑉𝑜𝑢𝑡 − 𝑉𝑜𝑢𝑡 𝑜𝑢𝑡 𝑜𝑢𝑡
+ − ). By
accumulation of products of inputs and weights can be expressed as: 0. In the figure, if 𝑅3∕𝑅1 = 𝑅4∕𝑅2 , then 𝑉𝑜 = 𝑅3∕𝑅1 (𝑉𝑜𝑢𝑡 − 𝑉𝑜𝑢𝑡
+ −.
setting 𝑅1 = 𝑅3 , 𝑉𝑜 is equal to the difference between 𝑉𝑜𝑢𝑡 and 𝑉𝑜𝑢𝑡
𝑁

2 + −
The comparison result of 𝑉𝑜𝑢𝑡 and 𝑉𝑜𝑢𝑡 serves as the control signal of
𝑅 = 𝐼𝑘 × 𝑊 𝑘 =
𝑘=1 the multiplexer, which selects the input voltage of next layer between
(13)

2 𝑁

2 𝑁

2𝑁

2𝑁 0 and 𝑉𝑜 . Through this, function 𝑦 = 𝑚𝑎𝑥(0, 𝑥) is realized.
𝐼𝑙𝑘 𝑊𝑙𝑘 + 24 ⋅ 𝐼𝑙𝑘 𝑊ℎ𝑘 + 24 ⋅ 𝐼ℎ𝑘 𝑊𝑙𝑘 + 28 ⋅ 𝐼ℎ𝑘 𝑊ℎ𝑘 . In 8-bit precision mode, since the multiply-add result is comprised
𝑘=1 𝑘=1 𝑘=1 𝑘=1
of four components, the analog implementation of activation function
As shown in Eq. (13), the accumulation result can be obtained by
has much more complexity, which would bring high energy and area
shifting and adding.
overhead. We therefore utilize a digital ReLU as the activation function
unit. The digital ReLU can be implemented by using a 8-bit comparator
4.4.4. Output-input driver or programmable logic device, such FPGA.
To eliminate the AD/DA between two adjacent layers, we propose
a novel analog Output-Input driver to convert the output current to 4.5. Multi-tier co-optimization
the input voltage. To achieve this, as shown in Fig. 13(a), a current
controlled current source is used to generate equivalent strength cur- In the proposed PIM-IoT as shown in Table 2, each PIM architecture
rent as the output current from the bitline. Then an op-amp is used has one or more parameters. These parameters can lead to various
power and latency behaviors. To meet certain power and latency
to convert the current to output voltage. Since we use 4-bit inputs
constraints, in this section, we explore how to dynamically tune PIM
and 4-bit synaptic weights, for a crossbar array contains 2𝑁 rows, the
parameters based on the requirements.
resolution of the output is (4 + 4 + 𝑁) bits. As we target at 4-bit outputs,
Since IoT systems are usually powered by battery or energy har-
we therefore shift the output to the right by (4 + 𝑁) bits. In analog
vester, the power supply is limited and varies with time. Let 𝐸𝑡𝑘 denote
signal, the shift operation is achieved by scaling the signal strength by
the max available power at time period 𝑡 for the 𝑘th tier. We can predict
a factor. In Fig. 13(a), 𝑉𝑜𝑢𝑡 is proportional to the output current (i.e., 𝐼𝑐 ):
𝐸𝑡𝑘 at time period 𝑡 − 1 based on the history max available power using
𝑉𝑜𝑢𝑡 = 𝑅𝑓 ×𝐼𝑐 , thus, the shift operation can be achieved by tuning 𝑅𝑓 to an ARIMA (autoregressive integrated moving average) model:
a proper value. In 4-bit precision mode, an analog activation function is
applied to the output voltage. If the next layer is convolution layer, the ∑
𝑝 ∑
𝑞
(1 − 𝜙𝑖 𝐿𝑖 )(1 − 𝐿)𝑑 𝐸𝑡𝑘 = (1 + 𝜃𝑖 𝐿𝑖 )𝜀𝑡 , (15)
transferred voltage, 𝑉𝑖𝑛 , is directly used as the input voltage of the next 𝑖=1 𝑖=1
layer. If the next layer is pooling layer, a 4-bit ADC is used to convert where 𝐿 is the lag operator, 𝜙𝑖 are the parameters of the autoregressive
𝑉𝑜𝑢𝑡 to a 4-bit fixed point number, which is then stored in the ReRAM model, 𝑝 is the order of the autoregressive model, 𝑑 is the degree
buffer and will be used by the pooling unit. In 8-bit precision mode, of differencing of autoregressive model, 𝜃𝑖 are the parameters of the
𝑉𝑜𝑢𝑡 is directly converted to 4-bit fixed point number. Since 8-bit input moving average model, and 𝑞 is the order of the moving average model.
and synaptic weight are respectively represented by two 4-bit inputs For a certain IoT application, the delay as well as the power con-
and synaptic weights, as shown in Eq. (13), a 8-bit result contains four sumption in each tier can be regarded as a function of the PIM pa-
components. We therefore shift and add the four components to obtain rameters. Let 𝑑1 (𝐷𝑤 ), 𝑑2 (𝐸𝑚 , 𝑁𝑝 ), 𝑑3 (𝑃𝑤 ) respectively denote the latency
the result with 8-bit resolution. Then the 8-bit result is stored to the function of the sensor tier, gateway tier, and edge node tier. Let
ReRAM buffer. 𝑒1 (𝐷𝑤 ), 𝑒2 (𝐸𝑚 , 𝑁𝑝 ), 𝑒3 (𝑃𝑤 ) respectively denote the power function for

7
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

Fig. 14. Parameter tuning loop in PIM-IoT. Fig. 15. Area breakdown (left), latency and energy comparison (middle), and execution
time and power consumption comparison (right).

each tier. The delay function can be constructed using the IoT ap-
plication model discussed in Section 3.2. The power function can be
constructed using a regressive model. Fig. 14 shows the tuning loop
for PIM-IoT. We collect the task delays and max available power in
each tier at the (𝑡 − 1)t h period. The max available power of each tier
is predicted at the next time period. Then the optimizer applies the
optimization policy and tunes the PIM parameters for each tier. We
Fig. 16. Execution time and power consumption comparison of IoT gateway tasks.
discuss the optimization policy in the following.
Policy 1: IoT application execution delay is minimized under the
power constraints. The objective function is:
The digital ReLU, pooling unit, and shift and add unit are designed
min(𝑑1 (𝐷𝑤 ) + 𝑑2 (𝐸𝑚 , 𝑁𝑝 ) + 𝑑3 (𝑃𝑤 )). (16) using Verilog HDL and synthesized using Synopsys design compiler
We tune the PIM parameters at time period 𝑡−1, and in the next time with a 65 nm TSMC CMOS library. The power and area of ADC are
period (i.e., 𝑡), the IoT application is running with new PIM parameters, extracted based on an ADC circuit survey [33], while the DAC power
thus the power consumption of each tier must be subjected to: and area are estimated based on a 4-bit DAC [34]. We also model the
memristor crossbar using a modified NVSim based on the device param-
𝑒1 (𝐷𝑤 ) ≤ 𝐸𝑡1 , 𝑒2 (𝐸𝑚 , 𝑁𝑝 ) ≤ 𝐸𝑡2 , 𝑒3 (𝑃𝑤 ) ≤ 𝐸𝑡3 . (17) eters reported in [35]. A trace-based simulator is designed to evaluate
the execution time and power consumption of CNN benchmarks.
If the total power consumption of IoT application is also restricted
to 𝑃 𝑜𝑤𝑒𝑟𝑚𝑎𝑥 , the following subjective function is applied:
5.2. Results and discussions
𝑒1 (𝐷𝑤 ) + 𝑒2 (𝐸𝑚 , 𝑁𝑝 ) + 𝑒3 (𝑃𝑤 ) ≤ 𝑃 𝑜𝑤𝑒𝑟𝑚𝑎𝑥 . (18)
5.2.1. Micro benchmarks
Policy 2: The total power consumption is minimized with maximum
5.3. Benchmarks
allowed application latency 𝐷𝑒𝑙𝑎𝑦𝑚𝑎𝑥 . The objective function is:

min(𝑒1 (𝐷𝑤 ) + 𝑒2 (𝐸𝑚 , 𝑁𝑝 ) + 𝑒3 (𝑃𝑤 )), (19) Table 4 lists the benchmarks adopted for evaluating PIM-IoT. We
use micro benchmarks to evaluate each IoT tier separately. Data encryp-
subject to: tion and compression are utilized to evaluate the proposed Bitwise-PIM
𝑑1 (𝐷𝑤 ) + 𝑑2 (𝐸𝑚 , 𝑁𝑝 ) + 𝑑3 (𝑃𝑤 ) ≤ 𝐷𝑒𝑙𝑎𝑦𝑚𝑎𝑥 , in IoT sensor nodes. Workloads that contain multiple data-dependent
tasks are adopted to evaluate the CGRA-PIM in IoT gateway. Two
𝑒1 (𝐷𝑤 ) ≤ 𝐸𝑡1 ,
(20) widely used CNNs, AlexNet and VGG-A are adopted to evaluate the
𝑒2 (𝐸𝑚 , 𝑁𝑝 ) ≤ 𝐸𝑡2 , proposed Approx-PIM in IoT edge computing node. We use ImageNet
𝑒3 (𝑃𝑤 ) ≤ 𝐸𝑡3 . as the workload for these two CNNs. Two IoT applications (described
Both policies are linear optimization problems, but the problem in Section 3.2) are adopted to evaluate the effectiveness of the whole
solution time is relatively short due to the limited tiers and parameters. IoT system.
Fig. 15(left) illustrates the area breakdown. To support in-memory
5. Evaluation bitwise operations, modification efforts to decoder, write driver, and
sensing circuitry are required and most of the efforts lie in the sensing
In this section, we evaluate the effectiveness of the proposed PIM- circuitry, which brings approximately 12.1% overhead.
IoT. We first introduce the experimental setup and benchmarks, and Fig. 15(middle) shows the latency and energy consumption com-
then report the detailed evaluation results with discussions. parison. For bitwise operations, the latency and energy consumptions
in the baseline are dominated by the load/store, as shown, in-memory
5.1. Experimental setup bitwise operations are around 10 – 19x faster than the baseline while
consuming around 7x less energy. Since XOR operation needs one
Table 3 summarizes our experimental setups. To analyze the power extra sense amplifier and multiplexer in Bitwise-PIM, it exhibits higher
consumption and area cost of the proposed Bitwise-PIM, SPICE [30] latency and energy consumption than INV/OR/AND operations. With
is adopted for circuit-level modeling and NVSim [31] is adopted for voltage scaling, the energy consumption of Bitwise-PIM can be further
memory modeling. Besides, we use a modified Gem5 [32] to estimate reduced. Fig. 15(right) gives the comparison of program execution
the execution time and power consumption. time and power consumption between baseline and Bitwise-PIM, which
CGRA-PIM is used as the co-processor in the baseline platform and reflects the overall effectiveness of Bitwise-PIM. As shown, due to
is assumed to be functional at 600 MHz. We simulate PEs in CGRA the low leakage power of STT-RAM, the power consumption is re-
at register-transfer level using Verilog HDL and synthesize them using duced by around 2.3x with around 1.3x performance improvements.
Synopsys design compiler with a TSMC 65 nm standard cell general In Fig. 15(middle and right), when the MTJ switching threshold time
purpose CMOS library. For system-level simulation, the CPU, MC, and increases from 2 ns to 6 ns, more power consumption reduction can be
memory subsystems of Gem5 are modified to have a CGRA co-processor achieved with a penalty of slight performance degradation. Therefore,
with parallel DRAM bank access. for time-critical IoT applications, a lower MTJ switching threshold time

8
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

Table 3
Experimental setups.
Arch Specifications
Baseline platform: Cortex-M0+, up to 133 MHz, 264 KB SRAM
Circuit level simulation: SPICE & NVSim
Sensor node
System level simulation: Gem5
(Bitwise-PIM)
Bitwise-PIM configurations:
RD latency: 0.75 ns, leakage: 4.1 mW, Guaranteed WR latency: 2 ns/6 ns
Baseline platform: Raspberry Pi 5, quad 2.4 GHz Cortex-A76 cores
Circuit level simulation: Verilog HDL
Gateway
System level simulation: Gem5
(CGRA-PIM)
CGRA-PIM configurations:
Number of PEs: 8/24/32, Execution mode: pipeline/serial mode
Baseline platform: Nvidia Jetson Orin Nano, 512 CUDA cores
Circuit level simulation: Verilog HDL & SPICE & NVSim
System level simulation: Trace-based simulator, similar to [14]
Edge node
Approx-PIM Configurations:
(Approx-PIM)
Number of PEs: 8 × 8, RRAM buffer: 2 MB per PE
Crossbar size: 128 × 128, latency: 4 ns
Precision: 4-bit or 8-bit fixed-point number

Table 4
The evaluated benchmarks.
Micro benchmarks
Tier Benchmark Description
Sensor SHA Secure hash algorithm
node
LZ4 Lossless data compression algorithm
JPEG Image encoding (4 tasks)

Gateway DWT-F DWT based image fusion (3 tasks)


Custom1 XML parsing → Bloom filter →
Accumlator → Average (4 tasks)
Custom2 CsvToSenML → SenML parsing → Range
filter → Bloom filter → Interpolation →
Join (6 tasks)
AlexNet [36] 5 CONV layers + 3 FC layers
Edge
computing VGG-A [37] 8 CONV layers + 3 FC layers
node PV [38] 5 CONV layers + 3 FC layers
HG [39] 2 CONV layers + 2 FC layers
IoT application benchmarks
Application 1: Automatic wildlife image classification
Sensors: RGB and IR camera, image size: 800 × 600, 10 images/sec
Gateway: Image fusion and encoding, connected with 2 RGB and IR cameras
Edge node: AlexNet-based image classification, connected with 2 gateways
Application 2: Wearable healthcare monitoring
Sensors: 64 channels EEG and 2 channels EOG at 256 Hz, LZW compression
Gateway: Bandpass filter and segmenting [40]
Edge node: EEGNet [41] based EEG classification

Table 5 Table 6
Area and power parameters of CGRA-PIM. Area and power parameters of Approx-PIM.
Memory unit Function unit Memory controller OI driver A-ReLU ADC DAC Crossbar Router ReRAM
Area (mm2 ) 0.049 0.035 0.09 Area (um2 ) 3 2 6000 30 150 64 000 3 × 105
Power (mW) 3.89 2.46 3.1 Power (mW) 0.01 0.007 1.8 0.48 2.7 0.13 6.6
Switch box SRAM bank Interconnection Dig. P. + ReLU Shift + Add Program driver C-mesh
Area (mm2 ) 0.089 0.16 0.054 Area (mm2 ) 0.001 0.0004 0.00013 3.2
Power (mW) 1.01 2.71 5.11 Power (mW) 1.28 2.8 0.45 3092
Total Per PE: 0.4 mm2 , 25 mW, Total: 18.00 mm2 , 913.32 mW Total PE: 0.66 mm2 , 132.1 mW, total: 45.4 mm2 , 11.5 W

can be adopted to reduce the latency, otherwise, a higher threshold Fig. 16 shows the normalized task execution time and power con-
time can be used to save more power. sumption with different PIM parameters. As shown, CGRA-PIM can
Table 5 shows the estimated area and power consumption of major largely reduce the task execution time. For JPEG, DWT-F, and Custom2,
components in CGRA-PIM. Note that we report the maximum power the pipeline execution mode is also better than the serial execution
consumption, the actual power is much lower than that and it depends mode. This is mainly because these tasks are more likely to have a
on the workloads. Each PE introduces 0.4mm2 area overhead with balanced pipeline partition. As the increasing of the number of PEs,
maximized power consumption 25mW. the execution time is reduced. However, the power consumption is

9
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

Fig. 17. Speedup of CNNs over baseline. Fig. 19. Multiplication-and-accumulation (MAC) throughput (left) and energy effi-
ciency (right) comparison between ResiRCA and Approx-PIM.

Fig. 18. Power reduction of CNNs over baseline.

Fig. 20. Normalized latency (left) and power consumption (right) breakdown of IoT
proportional to the number of PEs. More PEs can lead to higher power applications. T: STT-RAM write delay; N: # of PEs, L: pipeline mode, S: serial mode; P:
consumption. Different PIM parameters can lead to various latency and precision mode. For example, T6-N8L-P4 represents a configuration with 6 ns STT-RAM
power consumption behaviors. Therefore, it is important to tune the write delay, 8 CGRA PEs in pipeline mode, and 4-bit precision mode.
PIM parameters based on the power and latency constraints.
Table 6 shows the estimated area and power parameters of Approx-
PIM. We assume 8 inputs share one DAC and 8 outputs share one ADC. allowing it to support more complex networks, such as AlexNet. Fur-
However, the area of ADCs and DACs consumes around 44% of the thermore, since ResiRCA utilizes 4-bit quantization for both input and
total PE area, and the power of ADCs and DACs takes around 82% of weight representations, we configure Approx-PIM to operate in 4-bit
the total PE power. Compared with DACs/ADCs, the power and area precision mode to ensure a consistent comparison. In the comparative
of OI-driver and A-ReLU are negligible. Therefore, with the proposed evaluation, we examine the performance of ResiRCA using the Naive1
OI-driver and A-ReLU, the power consumption of convolution layers and Naive2 execution strategies. Fig. 19 shows the multiplication-
without pooling operations can be significantly reduced. and-accumulation (MAC) operation throughput and energy efficiency
We manually map benchmark CNNs to the PEs in Approx-PIM. Since comparison between ResiRCA and Approx-PIM. As shown, Approx-PIM
Nvidia Jetson Orin Nano natively supports INT8 operations, we use outperforms ResiRCA-Naive1 in both throughput and energy efficiency.
INT8 for benchmark CNNs in the baseline platform. While we can still With weights duplication, ResiRCA-Naive2 achieves the best perfor-
run quantized 4-bit CNNs on Nvidia Jetson Orin Nano, any 4-bit opera- mance as well as energy efficiency for small CNNs (i.e., PV and HG).
tion is emulated in software, which is generally less efficient than native However, when running large CNNs, such as AlexNet, the weight
hardware acceleration. Fig. 17 shows the speedups of CNNs in Approx- duplication in ResiRCA no longer works as the whole crossbars can-
PIM over the baseline platform. The average speedups in 4-bit and 8-bit not accommodate all the weights. Additionally, the lower operational
precision modes are 32.7x and 21.5x, respectively. Generally, 4-bit frequency (200 MHz compared to 600 MHz) contributes to ResiRCA
precision mode can achieve higher speedup than 8-bit precision mode. being slower than Approx-PIM. Furthermore, the analog ReLU imple-
Fig. 18 shows the power reduction of Approx-PIM, as shown, reducing mentation in Approx-PIM eliminates the need for AD/DA conversions
the data width can largely reduce the power consumption. Compared to between adjacent convolution layers, enabling Approx-PIM to achieve
the baseline platform, Approx-PIM can achieve 108.8x and 43.5x power better energy efficiency than ResiRCA.
reduction in 4-bit and 8-bit precision modes, respectively.
In 4-bit precision mode, the AD/DA is eliminated for layers without 5.3.1. IoT applications
pooling by using the proposed analog ReLU and Input-Output driver, We use two IoT applications (as shown in Table 4) to evaluate the
thus these layers (e.g., CONV3 for both AlexNet and VGG-A) can effectiveness of PIM-IoT under a set of PIM parameter combinations.
achieve higher speedup and more power saving. With higher data Fig. 20 shows the normalized latency and power consumption break-
width (i.e., 8-bit fixed point), most CNN applications only have slight down. Note that for Remote Image Classification, camera sensors are
accuracy loss compared to full precision (i.e., 16-bit/32-bit floating connected to IoT gateway directly, the latency and power consumption
point) [14,16]. However, since the 8-bit precision mode requires more breakdown only contains the time and power consumed by IoT gateway
crossbars and ADCs/DACs, it exhibits more power consumption than and edge computing node. Generally, PIM-IoT achieves 5.7x perfor-
the 4-bit precision mode. mance improvement and 6x power consumption reduction on average.
We also compared the proposed Approx-PIM to ResiRCA [42], In the figure, with PIM parameters ‘T2-N32L-P4’, PIM-IoT can achieve
which is a configurable resistive random-access memory (ReRAM) the highest performance, while PIM parameters ‘T6-N8L-P8’ achieve
crossbar-based CNN accelerator for energy-harvesting IoT devices. To the most power saving. As shown, both the latency and power con-
ensure a fair comparison between ResiRCA and Approx-PIM, we con- sumption are dominated by the IoT gateway and edge computing node.
figure ResiRCA to leverage the same memristor crossbar resources as For CGRA-PIM, increasing the number of PEs can reduce the latency,
Approx-PIM. Specifically, we utilize 64 crossbars with a 128 × 128 but it brings more power consumption. For Approx-PIM, changing the
configuration in our comparative evaluation. We also assume that there precision mode from 8-bit to 4-bit can achieve more power savings with
is sufficient harvested energy to activate all crossbars in ResiRCA, more performance improvement.

10
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

7. Conclusion

This paper proposes PIM-IoT, a three-tier PIM architecture designed


to handle diverse sensor data and inference tasks while enhancing en-
ergy efficiency and performance in IoT applications. PIM-IoT explores
hierarchical, heterogeneous, and adaptive PIM architectures for each
IoT tier and employs a co-optimization mechanism to dynamically
Fig. 21. Power and latency comparison between different control policies. tune PIM parameters, meeting the power consumption and latency
requirements of IoT applications. Experimental results demonstrate
that PIM-IoT effectively executes various data processing tasks, achiev-
To show the effectiveness of the proposed multi-tier co-optimization ing a 5.6x performance improvement and a 6x reduction in power
method, we evaluate the power consumption and latency of the two consumption.
IoT applications under different optimization policies. For comparison, Although PIM shows great promise in IoT systems for reducing
we add two single-tier naive optimization policies: naive-max-perf and energy consumption and enhancing performance, integrating PIM ar-
chitectures into existing systems remains challenging. Our current PIM
naive-min-power. Naive-max-perf tries to maximize the performance of
designs may also struggle with larger datasets or more complex tasks as
each tier without considering the total power constraint, while naive-
IoT applications grow in scale and complexity. To tackle these issues,
min-power targets minimize the power consumption of each tier without
our future research will focus on two key areas: (1) compatibility and
considering the latency constraints. As shown in Fig. 21, although naive-
integration, ensuring PIM accelerators work seamlessly with current
max-perf achieves the lowest latency, the total power consumption
hardware and software ecosystems for real-world deployment; and (2)
violates the power constraints. Similarly, naive-min-power exhibits the
dynamic resource allocation, developing mechanisms that allow PIM to
lowest power consumption, but the latency violates the latency con-
scale dynamically with workloads, enabling more efficient and flexible
straint. With multi-tier co-optimization, PIM parameters can be tuned resource utilization.
to achieve the global optimal performance or power based on certain
constraints.
CRediT authorship contribution statement

6. Related work Kan Zhong: Writing – review & editing, Writing – original draft,
Visualization, Validation, Methodology, Investigation, Formal analysis,
Data curation, Conceptualization. Qiao Li: Writing – review & editing,
In IoT systems, bitwise operations are essential in data compression
Writing – original draft, Methodology, Investigation. Ao Ren: Valida-
and encryption, leading to the development of various bitwise PIM
tion, Methodology. Yujuan Tan: Writing – review & editing, Software.
architectures. Pinatubo [18] investigated in-memory bitwise opera-
Xianzhang Chen: Writing – review & editing, Supervision. Linbo
tions using phase change memory (PCM). PIMDB [43] introduced a
Long: Writing – review & editing, Resources. Duo Liu: Supervision,
bulk-bitwise PIM architecture for relational databases. Ambit [44] and
Project administration.
SIMDRAM [45] facilitate in-memory bitwise operations with commod-
ity DRAM, while Racer [46] utilizes ReRAM for bit-pipelined pro-
Declaration of competing interest
cessing. Perach et al. [17] focused on ensuring consistency and co-
herency in bulk-bitwise PIM architectures. In contrast, Bitwise-PIM
The authors declare that they have no known competing finan-
targets low-power devices and incorporates performance-guaranteed
cial interests or personal relationships that could have appeared to
voltage scaling to reduce energy consumption.
influence the work reported in this paper.
For energy-efficient IoT data processing, Brandalero et al. [47]
proposed a CGRA accelerator for single-ISA heterogeneous systems to
Acknowledgments
accelerate data-intensive IoT tasks, while Lee et al. [48] and Burger
et al. [49] suggested to enhance the multi-core processors with FPGA
I would like to express my sincere gratitude to all the reviewers for
for similar applications. However, all these approaches neglect the
their valuable feedback and constructive comments. Their insights and
data movement overhead in data-intensive tasks. To enhance neural suggestions have greatly improved the quality of this work. The work
networks, which play a key role in intelligent IoT edge devices, several described in this paper is supported by the National Natural Science
memory crossbar PIM architectures have been proposed. PRIME [14], Foundation of China (Grant No. 62402070 and 62202396).
ISAAC [15], and PipeLayer [16] are designed for high-performance
CNN training or inference. Yuan et al. [50] explored weight prun-
Data availability
ing’s impact on model fault tolerance in ReRAM-based DNN acceler-
ator. ResiRCA [42] is tailored for ultra-low power energy-harvesting
Data will be made available on request.
IoT devices, optimizing CNN execution costs according to fluctuating
power availability. Similarly, Yang et al. [51] introduced a multi-
level operation mode for implementing spiking neural networks (SNNs) References
using ReRAM crossbars, enabling computational adaptation to vary-
ing environmental energy. ConvFIFO [52] utilizes a first-in-first-out [1] A. Alwarafy, K.A. Al-Thelaya, M. Abdallah, J. Schneider, M. Hamdi, A survey on
(FIFO) dataflow mechanism to enhance CNN performance in non- security and privacy issues in edge-computing-assisted internet of things, IEEE
volatile memory (NVM)-based PIM accelerators. In contrast to these Internet Things J. 8 (6) (2021) 4004–4022.
[2] C. Butpheng, K.-H. Yeh, H. Xiong, Security and privacy in IoT-cloud-based
approaches, Approx-PIM employs two precision modes and analog
e-health systems—A comprehensive review, Symmetry 12 (7) (2020).
ReLU to minimize the crossbar resources and energy required by CNNs.
[3] B. Zhang, N. Mor, J. Kolb, D.S. Chan, N. Goyal, K. Lutz, E. Allman, J. Wawrzynek,
Besides, we also considered the co-optimization of multi-tier PIM archi- E. Lee, J. Kubiatowicz, The cloud is not enough: Saving IoT from the cloud, in:
tectures to meet the power consumption and latency requirements of Proceedings of the 7th USENIX Conference on Hot Topics in Cloud Computing,
IoT applications. HotCloud, 2015, pp. 21–27.

11
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

[4] D. Schwartz, J.M.G. Selman, P. Wrege, A. Paepcke, Deployment of embedded [27] A. Raychowdhury, D. Somasekhar, T. Karnik, V. De, Design space and scalability
edge-AI for wildlife monitoring in remote regions, in: 2021 20th IEEE Interna- exploration of 1T-1STT MTJ memory arrays in the presence of variability and
tional Conference on Machine Learning and Applications, ICMLA’21, 2021, pp. disturbances, in: Proceedings of the 2009 IEEE International Electron Devices
1035–1042. Meeting, IEDM, 2009, pp. 1–4.
[5] C. Li, Y. Hu, L. Liu, J. Gu, M. Song, X. Liang, J. Yuan, T. Li, Towards sustainable [28] Q.K. Trinh, S. Ruocco, M. Alioto, Voltage scaled STT-MRAMs towards minimum-
in-situ server systems in the big data era, in: Proceedings of the 2015 ACM/IEEE energy write access, IEEE J. Emerg. Sel. Top. Circuits Syst. 6 (3) (2016)
42nd Annual International Symposium on Computer Architecture, ISCA, 2015, 305–318.
pp. 14–26. [29] C.-Y. Lu, R.-S. Tsay, W. Chang, An embedded CNN design for edge devices based
[6] S. Hamdan, M. Ayyash, S. Almajali, Edge-computing architectures for internet of on logarithmic computing, in: 2022 International Symposium on VLSI Design,
things applications: A survey, Sensors 20 (22) (2020). Automation and Test, VLSI-DAT, 2022, pp. 1–4.
[7] L. Kong, J. Tan, J. Huang, G. Chen, S. Wang, X. Jin, P. Zeng, M. Khan, S.K. Das, [30] L.W. Nagel, D. Pederson, SPICE (Simulation Program with Integrated Circuit Em-
Edge-computing-driven internet of things: A survey, ACM Comput. Surv. 55 (8) phasis), Technical Report, EECS Department, University of California, Berkeley,
(2022). 1973, URL http://www2.eecs.berkeley.edu/Pubs/TechRpts/1973/22871.html.
[8] H. Guo, Z. Zhou, D. Zhao, W. Gaaloul, EGNN: Energy-efficient anomaly detection [31] X. Dong, C. Xu, Y. Xie, N.P. Jouppi, NVSim: A circuit-level performance, energy,
for IoT multivariate time series data using graph neural network, Future Gener. and area model for emerging nonvolatile memory, IEEE Trans. Comput.-Aided
Comput. Syst. 151 (2024) 45–56. Des. Integr. Circuits Syst. 31 (7) (2012) 994–1007.
[9] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: [32] N. Binkert, B. Beckmann, G. Black, S.K. Reinhardt, A. Saidi, A. Basu, J. Hestness,
Efficient inference engine on compressed deep neural network, in: Proceedings D.R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M.D.
of the 43rd International Symposium on Computer Architecture, ISCA, 2016, pp. Hill, D.A. Wood, The Gem5 simulator, ACM SIGARCH Comput. Archit. News 39
243–254. (2) (2011) 1–7.
[10] Q. He, Z. Dong, F. Chen, S. Deng, W. Liang, Y. Yang, Pyramid: Enabling [33] B. Murmann, ADC Performance Survey 1997–2024, 2024, https://github.com/
hierarchical neural networks with edge computing, in: Proceedings of the ACM bmurmann/ADC-survey.
Web Conference 2022, 2022, pp. 1860–1870. [34] A.S. Kherde, R.G. Pritesh, An efficient design of R-2R digital to analog converter
[11] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, with better performance parameter in (90nm) 0.09-um CMOS process, Int. J.
A. Kuusela, A. Knies, P. Ranganathan, et al., Google workloads for consumer de- Innov. Technol. Explor. Eng. (IJITEE) 3 (7) (2013) 2278–3075.
vices: Mitigating data movement bottlenecks, in: Proceedings of the Twenty-Third [35] K.-H. Kim, S. Gaba, D. Wheeler, J.M. Cruz-Albrecht, T. Hussain, N. Srinivasa, W.
International Conference on Architectural Support for Programming Languages Lu, A functional hybrid memristor crossbar-array/CMOS system for data storage
and Operating Systems, ASPLOS’18, 2018, pp. 316–331. and neuromorphic applications, Nano Lett. 12 (1) (2011) 389–395.
[12] S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, D. Glasco, GPUs and the [36] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep con-
future of parallel computing, IEEE Micro 31 (5) (2011) 7–17. volutional neural networks, in: Proceedings of the 25th International Conference
on Neural Information Processing Systems, NIPS, 2012, pp. 1097–1105.
[13] S. Ghose, A. Boroumand, J.S. Kim, J. Gómez-Luna, O. Mutlu, Processing-
[37] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
in-memory: A workload-driven perspective, IBM J. Res. Dev. 63 (6) (2019)
image recognition, 2014, CoRR, abs/1409.1556.
3:1–3:19.
[38] R. Wang, Z. Xu, A pedestrian and vehicle rapid identification model based on
[14] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, RIME: A novel
convolutional neural network, in: Proceedings of the 7th International Conference
processing-in-memory architecture for neural network computation in reram-
on Internet Multimedia Computing and Service, ICIMCS’15, 2015.
based main memory, in: Proceedings of the 43rd International Symposium on
[39] H.-I. Lin, M.-H. Hsu, W.-K. Chen, Human hand gesture recognition using a con-
Computer Architecture, ISCA, 2016, pp. 27–39.
volution neural network, in: 2014 IEEE International Conference on Automation
[15] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J.P. Strachan, M. Hu,
Science and Engineering, CASE’14, 2014, pp. 1038–1043.
R.S. Williams, V. Srikumar, ISAAC: A convolutional neural network accelerator
[40] D. Sen, B.B. Mishra, P.K. Pattnaik, A review of the filtering techniques used
with in-situ analog arithmetic in crossbars, in: Proceedings of the 43rd ACM/IEEE
in EEG signal processing, in: 2023 7th International Conference on Trends in
International Symposium on Computer Architecture, ISCA, 2016, pp. 14–26.
Electronics and Informatics, ICOEI’23, 2023, pp. 270–277.
[16] L. Song, X. Qian, H. Li, Y. Chen, PipeLayer: A pipelined reram-based accelerator
[41] V.J. Lawhern, A.J. Solon, N.R. Waytowich, S.M. Gordon, C.P. Hung, B.J. Lance,
for deep learning, in: Proceeding of the 23rd IEEE International Symposium on
EEGNet: a compact convolutional neural network for EEG-based brain–computer
High Performance Computer Architecture, HPCA, 2017, pp. 14–26.
interfaces, J. Neural Eng. 15 (5) (2018).
[17] B. Perach, R. Ronen, S. Kvatinsky, On consistency for bulk-bitwise processing-
[42] K. Qiu, N. Jao, M. Zhao, C.S. Mishra, G. Gudukbay, S. Jose, J. Sampson,
in-memory, in: Proceedings of the 2023 IEEE International Symposium on
M.T. Kandemir, V. Narayanan, ResiRCA: A resilient energy harvesting ReRAM
High-Performance Computer Architecture, HPCA’23, 2023, pp. 705–717.
crossbar-based accelerator for intelligent embedded processors, in: 2020 IEEE
[18] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, Y. Xie, Pinatubo: A processing-in-memory
International Symposium on High Performance Computer Architecture, HPCA’20,
architecture for bulk bitwise operations in emerging non-volatile memories, in:
2020, pp. 315–327.
2016 53nd ACM/EDAC/IEEE Design Automation Conference, DAC, 2016, pp.
[43] B. Perach, R. Ronen, B. Kimelfeld, S. Kvatinsky, Understanding bulk-bitwise
1–6.
processing in-memory through database analytics, IEEE Trans. Emerg. Top.
[19] Raspberry Pi, Raspberry pi pico, 2024, https://www.raspberrypi.com/products/
Comput. 12 (1) (2024) 7–22.
raspberry-pi-pico/. (Accessed 14 May 2024).
[44] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M.A.
[20] Raspberry Pi Fundation, Raspberry pi 5, 2022, https://www.raspberrypi.com/ Kozuch, O. Mutlu, P.B. Gibbons, T.C. Mowry, Ambit: in-memory accelerator
products/raspberry-pi-5/. (Accessed 14 May 2024). for bulk bitwise operations using commodity DRAM technology, in: Proceedings
[21] NVIDIA Corporation, Nvidia jetson orin nano, 2023, https://www.nvidia.com/en- of the 50th Annual IEEE/ACM International Symposium on Microarchitecture,
us/autonomous-machines/embedded-systems/jetson-orin/. (Accessed 30 October MICRO’17, 2017, pp. 273–287.
2024). [45] N. Hajinazar, G.F. Oliveira, S. Gregorio, J.D. Ferreira, N.M. Ghiasi, M. Patel, M.
[22] A. Shukla, S. Chaturvedi, Y. Simmhan, RIoTBench: A real-time IoT benchmark Alser, S. Ghose, J. Gómez-Luna, O. Mutlu, SIMDRAM: a framework for bit-serial
for distributed stream processing platforms, 2017, CoRR. URL http://arxiv.org/ SIMD processing using DRAM, in: Proceedings of the 26th ACM International
abs/1701.08530. Conference on Architectural Support for Programming Languages and Operating
[23] D. Gajaria, K. Antony Gomez, T. Adegbija, A study of STT-RAM-based in-memory Systems, ASPLOS’21, 2021, pp. 329—-345.
computing across the memory hierarchy, in: 2022 IEEE 40th International [46] M.S.Q. Truong, E. Chen, D. Su, L. Shen, A. Glass, L.R. Carley, J.A. Bain, S.
Conference on Computer Design, ICCD’22, 2022, pp. 685–692. Ghose, RACER: Bit-pipelined processing using resistive memory, in: Proceedings
[24] B. Jahannia, S.A. Ghasemi, H. Farbeh, An energy efficient multi-retention STT- of the 54th Annual IEEE/ACM International Symposium on Microarchitecture,
MRAM memory architecture for IoT applications, IEEE Trans. Circuits Syst. II: MICRO’21, 2021, pp. 100—-116.
Express Briefs 71 (3) (2024) 1431–1435. [47] M. Brandalero, L. Carro, A.C.S. Beck, M. Shafique, Multi-target adaptive recon-
[25] D. Gajaria, T. Adegbija, Evaluating the performance and energy of STT-RAM figurable acceleration for low-power IoT processing, IEEE Trans. Comput. 70 (1)
caches for real-world wearable workloads, Future Gener. Comput. Syst. 136 (2021) 83–98.
(2022) 231–240. [48] S.K. Lee, P.N. Whatmough, M. Donato, G.G. Ko, D. Brooks, G.-Y. Wei, SMIV:
[26] D. Wu, P. Chen*, T.K. Bandara, Z. Li, T. Mitra, Flip: Data-centric edge CGRA A 16-nm 25-mm2 SoC for IoT with arm cortex-A53, eFPGA, and coherent
accelerator, ACM Trans. Des. Autom. Electron. Syst. 29 (1) (2023). accelerators, IEEE J. Solid-State Circuits 57 (2) (2022) 639–650.

12
K. Zhong et al. Future Generation Computer Systems 169 (2025) 107782

[49] A. Burger, C. Cichiwskyj, S. Schmeißer, G. Schiele, The elastic internet of things- Yujuan Tan received her Ph.D. degree in Computer Science
a platform for self-integrating and self-adaptive IoT-systems with support for and Engineering from Huazhong University of Science and
embedded adaptive hardware, Future Gener. Comput. Syst. 113 (2020) 607–619. Technology, Wuhan, China in 2012. She obtained her B.S.
[50] G. Yuan, Z. Liao, X. Ma, Y. Cai, Z. Kong, X. Shen, J. Fu, Z. Li, C. Zhang, H. degree in Computer Science and Engineering from Hunan
Peng, N. Liu, A. Ren, J. Wang, Y. Wang, Improving DNN fault tolerance using Normal University, Changsha, China, in 2006. She is now a
weight pruning and differential crossbar mapping for ReRAM-based edge AI, in: Professor in the College of Computer Science in Chongqing
Proceedings of the 2021 22nd International Symposium on Quality Electronic University, Chongqing, China. Her research interests include
hybrid memory system, flash cache and data deduplication.
Design, ISQED’21, 2021, pp. 135–141.
[51] Y. Yang, M. Li, C. Xu, K. Qiu, A resilient ReRAM crossbar-based PIM design for
SNN in energy harvesting scenarios, in: Proceedings of the 12th International
Workshop on Energy Harvesting and Energy-Neutral Sensing Systems, ENSsys’24,
2024, pp. 21–27.
Xianzhang Chen received Ph.D. degree from the College of
[52] L. Zhao, Y. Qian, F. Meng, X. Xu, X. Yin, C. Zhuo, ConvFIFO: A crossbar
Computer Science at Chongqing University, China in 2017.
memory PIM architecture for ConvNets featuring first-in-first-out dataflow, in: He obtained his BS and MS degrees in Computer Science
Proceedings of the 2024 29th Asia and South Pacific Design Automation and Engineering from Southeast University, Nanjing, China.
Conference, ASP-DAC’24, 2024, pp. 824–829. He was a Research Fellow at the National University of
Singapore from 2019 to 2020. He is currently an Associate
Professor with Chongqing University. Dr. Chen was a recipi-
ent of best paper awards in IEEE NVMSA’2015 and ICCD’17,
Kan Zhong is currently an associate professor at School
"the editor’s pick of 2016" of IEEE TC, and the Chongqing
of Big Data and Software Engineering, Chongqing Univer-
Best Ph.D. Dissertation Award in 2018.
sity. He received the Ph.D. and BS degree in computer
science from the College of Computer Science, Chongqing
University, China, in 2018 and 2013, respectively. He was Linbo Long received the Ph.D. degree in computer science
a visiting scholar at the Department of Electrical and Com- from the College of Computer Science, Chongqing Univer-
puter Engineering, University of Florida, USA. from 2016 sity, China, in 2016. And he also received the B.S. degree
to 2017. His research interests include storage systems, in computer science from the School of Computer Science
memory subsystem, and computer architecture. and Technology, Chongqing University, in 2011. He is
currently a professor and chair with the Department of Com-
puter in the College of Computer Science and Technology,
Chongqing University of Posts and Telecommunications,
Qiao Li received the Ph.D. degree from the Department Chongqing, China. His current research interests include
of Computer Science, City University of Hong Kong in compiler optimization, emerging memory techniques, big
2021. She is now an associate professor in the School data, and embedded systems.
of Informatics, Xiamen University. Her research interests
include NAND flash memory, storage systems, and computer
architecture. Duo Liu is a tenured professor with the College of Computer
Science, Chongqing University, China. He received the Ph.D.
degree in computer science from The Hong Kong Poly-
technic University in 2012. He received the B.E. degree in
computer science from the Southwest University of Science
and Technology, Sichuan, China, in 2003, and the M.E. de-
gree from the Department of Computer Science, University
Ao Ren received his B.S. degree from Dalian University of of Science and Technology of China, Hefei, China, in 2006,
Technology, Dalian, China in 2013, he received his M.S. respectively. His current research interests include emerg-
degree from Sysracuse University, Syracuse, USA in 2015. ing non-volatile memory (NVM) techniques for embedded
He obtained his Ph.D. degree from Northeastern University, systems, memory/storage management in mobile systems,
Boston, USA in 2020. He is now a Professor in the College and hardware/software co-design. He has served as program
of Computer Science at Chongqing University, Chongqing, committee for multiple international conferences, and as
China. His research interest include Deep Learning Model reviewer for several ACM/IEEE journals and transactions.
Compression and Domain Specific Accelerator Architecture.

13

You might also like