Applications of Emerging Memory Technology - Beyond Storage
Applications of Emerging Memory Technology - Beyond Storage
Applications of
Emerging Memory
Technology
Beyond Storage
Springer Series in Advanced Microelectronics
Volume 63
Series Editors
Kukjin Chun, Department of Electrical and Computer Engineering, Seoul National
University, Seoul, Korea (Republic of)
Kiyoo Itoh, Hitachi Ltd., Tokyo, Japan
Thomas H. Lee, Department of Electrical Engineering CIS-205, Stanford
University, Stanford, CA, USA
Rino Micheloni, Torre Sequoia, II piano, PMC-Sierra, Vimercate (MB), Italy
Takayasu Sakurai, The University of Tokyo, Tokyo, Japan
Willy M. C. Sansen, ESAT-MICAS, Katholieke Universiteit Leuven, Leuven,
Belgium
Doris Schmitt-Landsiedel, Lehrstuhl fur Technische Elektronik, Technische
Universität München, Munich, Germany
The Springer Series in Advanced Microelectronics provides systematic information
on all the topics relevant for the design, processing, and manufacturing of
microelectronic devices. The books, each prepared by leading researchers or
engineers in their fields, cover the basic and advanced aspects of topics such as
wafer processing, materials, device design, device technologies, circuit design,
VLSI implementation, and sub-system technology. The series forms a bridge
between physics and engineering, therefore the volumes will appeal to practicing
engineers as well as research scientists.
Applications of Emerging
Memory Technology
Beyond Storage
123
Editor
Manan Suri
Department of Electrical Engineering
Indian Institute of Technology Delhi
New Delhi, Delhi, India
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
He who sees inaction in action, and action
in inaction
Is Spiritually wise, transcendentally situated
a perfect performer of all actions
(Shrimad Bhagwad Gita, Chapter 4, Verse 18)
In loving memory of
Harbans Kaur, Gyan Chand, Raj Rani, and
Jagdish Chander
Preface
Let me try to keep this Preface short and simple so that readers can save time for
actual technical content.
If Data is the question? Memory is the answer!
Over the last few decades, the quanta of memory, memory-related devices, and
circuits on most silicon dies have increased manifold and will further increase in the
time to come. This leads us to the question; if the presence of memory is becoming
more and more profound, why not exploit it for multiple applications beyond
simple conventional storage? The emergence of different flavors of new memory
materials and devices with diverse underlying physics has opened many new
application opportunities. The contributions in this edition are an effort in the
direction of showcasing applications beyond simple 1/0 storage that can be realized
using emerging nanoscale, non-volatile memory devices, materials, and circuits.
The present volume is a work in progress, and we hope to improve it further with
your feedback. The book in its current form may be used as a research reference
text as well as reading material for advanced courses. I would like to express deep
gratitude to all the contributing researchers and their teams for presenting excellent
technical content for this project and also the project co-ordinator for sincere efforts
in making this edition possible.
ix
Contents
xi
Editor, Project Co-ordinator, and Contributors
xiii
xiv Editor, Project Co-ordinator, and Contributors
Project Co-ordinator
Contributors
Zhaohao Wang, Bi Wu, Chao Wang, Wang Kang and Weisheng Zhao
Abstract Non-volatile (NV) cache is desired for overcoming the power and speed
bottlenecks of the modern static random access memory (SRAM). A promising
candidate for constructing the NV cache is the spin transfer torque magnetic RAM
(STT-MRAM), which is featured with low power, fast speed, high density and nearly
unlimited endurance. In this chapter, we will review the efforts made to realize
the STT-MRAM based NV cache, ranging from architecture to device levels. In
addition, the application potential of emerging spintronics technologies, such as spin
orbit torque (SOT) and voltage-controlled magnetic anisotropy (VCMA), will be
discussed in terms of their benefits and challenges.
1.1 Introduction
In a computing system, processors and memories are the key modules which, respec-
tively, perform arithmetic operations and store data/instructions. Therefore, the com-
puting efficiency is strongly dependent on both the execution speed of the processor
and the access speed of the memory. Unfortunately, it is often that unsatisfying
match between these two speeds exists in a typical computing architecture. Gen-
Speed, Cost
Cache volatile
Processing Unit) [1]
Main Memory
Capacity
(b)
8 Core CPU
Core Core Core Core
Pipeline Pipeline Pipeline Pipeline
L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D
L2 L2 L2 L2
Main
Memory
Shared L3 Cache or LLC
(DRAM)
L2 L2 L2 L2
L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D
Core Core Core Core
Pipeline Pipeline Pipeline Pipeline
erally, accessing the memories requires much more latency than dealing with the
instructions in the processors. As a result, actually, the performance of a computing
system is mainly determined by the memory bandwidth rather than the processor
frequency. This issue is known as “memory wall” in modern computers. Take the
state-of-the-art technologies, for instance, the base frequency of an Intel Core i7 pro-
cessor can be as high as 3.70 GHz, whereas the speed of a Samsung DDR3 dynamic
random access memory (DRAM) is 1600 Mbps. To overcome the “memory wall”,
the modern computers employ the memory hierarchy shown in Fig. 1.1a, where var-
ious types of memories are organized at different levels according to the capacity
and speed. The most frequently-accessed data or instructions are copied into several
high-speed memories, which are embedded into or very close to the processors (e.g.,
CPU in Fig. 1.1a). These memories are called caches which efficiently reduce the
speed gap between the processor and main memory. Furthermore, with the rise of
the multi-core processor, the computing efficiency is improved meanwhile the cache
capacity needs to be increased to accommodate the more data and instructions. Sim-
ilar to Fig. 1.1a, caches are also organized as a hierarchy of multiple levels including
L1–L2 and shared L3 (last level cache, or LLC), as shown in Fig. 1.1b. L1 cache
requires an access speed as fast as possible. By contrast, in an LLC the large capacity
is desirable but the slower speed is tolerated.
As mentioned above, the cache needs to have a faster speed than the main memory.
This difference can be explained by the bit-cell structures of a cache and a main
memory shown in Fig. 1.2. The cache and main memory are constructed with the
1 Towards Spintronics Nonvolatile Caches 3
(a) WL (b)
VDD Word Line
M2 M4
M5 M6
Q
Q
M1 M3 Bit Line
BL BL
Fig. 1.2 Schematic bit-cell structures of the conventional a SRAM and b DRAM
static random access memory (SRAM) and DRAM, respectively. The SRAM bit-cell
consists of six transistors. The 1-bit data is read or written into two cross-coupled
inverters (M1–M4) through two controlling transistors (M5–M6). The DRAM bit-
cell is composed of an access transistor connected with a capacitor. The read and
write operations of the data are performed through discharging and charging the
capacitor. Since the charges on the capacitor need to be maintained by a periodical
refresh, the DRAM is accessed more slowly than the SRAM.
Despite fast access, the SRAM-based cache is not a perfect memory due to the fol-
lowing two issues. First, the SRAM occupies much larger area than the DRAM due
to more transistors. In a modern microprocessor, the SRAM-based caches occupy
more than one half of the chip area. Moreover, the capacity of the SRAM-based
cache is very limited compared with the DRAM-based main memory. For instance,
the capacities of the cache and maim memory in a ThinkPad-X1 Carbon laptop are,
respectively, 8 MB and 16 GB. Second, the SRAM is volatile and consumes consid-
erable energy. Especially, the leakage current of the transistors cannot be eliminated
since the power supply has to be always-on for keeping the data. With the scal-
ing of the CMOS technology, the leakage current has become the major source of
the chip power. Especially in a multi-core system, the leakage current of the large-
capacity LLC consumes most of the total power consumption. These bottlenecks
severely impede the sustainable optimization of the SRAM-based cache. Although
the embedded DRAM (eDRAM) is proposed and used as the large-capacity LLC, it
is still difficult to reduce the power consumption of the eDRAM.
To develop the high-performance cache beyond the SRAM, both the academia
and industry are exploiting the nonvolatile memory (NVM) technologies which offer
the advantages of high density and low power over the volatile SRAM. In particu-
lar, the data can be retained into the NVM cell without the need of power supply,
promising to achieve nearly zero leakage power consumption. Among the various
NVM technologies, Flash has been widely commercialized in the application of mass
storage (e.g., USB flash drive and solid state drive) [2]. However, it is unsatisfactory
to construct a Flash-based cache since the Flash suffers from low write endurance
(∼105 cycles) and slow access operation (microsecond to millisecond). Phase-change
RAM (PCRAM) shows higher endurance (∼109 cycles) than the Flash, meanwhile,
4 Z. Wang et al.
the storage density is sufficiently high [3]. But the write operation of the PCRAM
is achieved through heating and cooling processes, which requires a large latency of
hundreds of nanoseconds and cannot satisfy the demand of the cache. Alternatively,
the PCRAM is widely accepted as a competitive candidate for the nonvolatile main
memory. For the ferroelectric RAM (FeRAM), the nonvolatile information is rep-
resented by the ferroelectric polarization [4]. Thus, the device size has to be large
enough to store adequate charges and provide detectable signals. As a result, the stor-
age density of the ferroelectric memory is much lower than the Flash and PCRAM. In
addition, the read operation of the ferroelectric memory is destructive. These draw-
backs prohibit the ferroelectric memory from being used as cache. In sum, it is a
challenging task to develop a NVM device for meeting all the requirements (den-
sity, speed, power, etc.) of the cache. Currently, relatively promising candidates for
the NV-caches are the magnetic RAM (MRAM) [5, 6] and resistive RAM (RRAM)
[7]. However, limited endurance and uncontrollable process variation are the main
bottlenecks of the RRAM. By contrast, the MRAM is more competitive as it offers
almost unlimited endurance and good compatibility with CMOS process. Especially,
several MRAM chips have shown excellent performance while using as LLC [8, 9].
Table 1.1 summarizes the characteristics of the abovementioned memory devices
[10].
This chapter will provide an overview of the MRAM-based NV-cache. Despite
exciting merits of the MRAM, currently both the write current and write latency of
the MRAM are still higher than those of the SRAM. Thereby design considerations
for the MRAM cache are indispensable. Here we present the optimization strategies
for the MRAM cache. The potential of several emerging MRAMs in the application
of NV-cache is evaluated as well.
The storage element of the MRAM is the magnetic tunnel junction (MTJ) shown in
Fig. 1.3a. The core structure of a MTJ consists of two ferromagnetic layers separated
by a tunnel barrier. Two ferromagnetic layers are called pinned layer (or reference
layer) and free layer, respectively. The magnetization direction of the pinned layer is
fixed, whereas that of the free layer is switchable. Once a bias voltage is applied to the
1 Towards Spintronics Nonvolatile Caches 5
(a) (b)
Resistance
AP state
Pinned layer
Tunnel barrier
Magnetic field
Free layer
P state
Fig. 1.3 a Schematic structure of the magnetic tunnel junction. b Tunneling magnetoresistance
effect: the MTJ resistance can be switched between high and low values by reversing the magneti-
zation of the free layer
MTJ, the electrons flow through the barrier in the manner of quantum tunneling. As a
result, the tunnel current is formed. The magnitude of the tunnel current is dependent
on the relative magnetization orientation between the pinned layer and free layer. If
the magnetization directions of two ferromagnetic layers are parallel (P) to each
other, a large tunnel current is induced so that the MTJ is in the low-resistance state.
In contrast, the MTJ is in the high-resistance state if the magnetization orientation
of the free layer is switched to be antiparallel (AP) to that of the pinned layer (see
Fig. 1.3b). The 1-bit nonvolatile data can be represented by the MTJ resistance states.
The tunneling magnetoresistance (TMR) ratio is expressed as follows:
RAP − RP
TMR = × 100% (1.1)
RP
where R A P and R P are the MTJ resistances for AP and P states, respectively.
The write operation of the MRAM is achieved through switching the magne-
tization of the free layer. In the early stage of the MRAM development (around
2000s), the field-induced magnetization switching (FIMS) was used for the data
writing of the MRAM. For instance, the first commercial MRAM product launched
by Freescale employed an improved FIMS-like technology called toggle [11]. But
the FIMS-based schemes are criticized for the high power-consumption and poor
scalability. Alternatively, the pure electrical write operation is strongly desired. In
this context, current-induced spin transfer torque (STT) was proposed for the low-
power and scalable magnetization switching [12–14]. To generate the STT, an electric
current flowing through the MTJ is spin-polarized by the pinned layer. While this
spin-polarized current interacts with the free layer, a spin torque is transferred to
switch the magnetization due to the angular momentum conservation. The direction
of the switched magnetization depends on the polarity of the applied current. So far,
the STT has been intensively studied and become a mainstream write technology
for the MRAM. A number of commercial STT-MRAM products have been contin-
uously released by Everspin [15]. Some attempts have been made to develop the
STT-MRAM-based cache, which will be discussed in the following sections.
6 Z. Wang et al.
The read operation of the MRAM is implemented by comparing the MTJ resis-
tance with a reference value. At the circuit level, this comparison is translated into
the difference of the voltage or current by a sensing amplifier [16–18].
For a MTJ, the easy-axis of the ferromagnetic layers could be in-plane or perpen-
dicular (out-of-plane). Early MTJs were fabricated with the in-plane anisotropy. But
the in-plane MTJ suffers from two drawbacks: first, the anisotropy originates from
the aspect ratio of the elliptical shape. With the scaling of the technology node, it
is increasingly difficult to keep satisfying anisotropy field. Second, for the in-plane
STT-MTJ, the STT has to overcome the demagnetization field which makes no con-
tribution to the thermal stability barrier (), as explained by (1.2)–(1.3). Therefore,
the perpendicular MTJ is preferred for the high-performance MRAM [19].
μ0 Ms Hk V
= (1.2)
2
αγ μ0 e Ms
Ic_in ≈ Ms V Hk + (1.3)
μB g 2
where Ic_in is the critical STT current of the in-plane anisotropy MTJ, μ0 is the vac-
uum permeability, Ms is the saturation magnetization, Hk is the anisotropy field, V is
the volume of the free layer, α is the Gilbert damping constant, γ is the gyromagnetic
ratio, e is the electron charge, μ B is the Bohr magneto, g is a device-dependent factor.
It is seen that the term associated with Ms /2 makes no contribution to the thermal
stability barrier.
(a) (b) SL BL SL BL SL BL
Bit Line WL
Bipolar WL
MTJ
Write Pulse/
Read Bias
Generator Word Line
Transistor
WL
Sense Amp.
Bit Line WL
Source Line
Ref.
Fig. 1.4 1T1J MRAM [20]: a bit-cell structure and periphery. b Cell array
50 4
SRAM
Read Latency (ns)
40 STT-MRAM 3
Area (mm2)
30
2
20
1
10
0.0 0
128K 512K 1M 4M 8M 16M 128K 512K 1M 4M 8M 16M
2.5 5 6
5
Write Latency (ns)
2.0 4
Read Energy (nJ)
4
1.5 3
3
1.0 2
2
0.5 1 1
0.0 0 0
128K 512K 1M 4M 8M 16M 128K 512K 1M 4M 8M 16M 128K 512K 1M 4M 8M 16M
Fig. 1.5 Typical architecture-level results of performance comparison between the SRAM and
STT-MRAM caches
(iii) However, the dynamic power (especially write power) of the STT-MRAM is
much higher than that of the SRAM.
(iv) Assume that the total power in a cache is composed of leakage power and
dynamic power, the STT-MRAM cache can consume lower power than the
SRAM if the reduced leakage power is larger than the increasing dynamic
power.
(v) In addition, the write latency of the STT-MRAM is much larger than the SRAM,
which may decrease the throughput and IPC (instructions per cycle).
Considering the above points, it is not recommended to use the STT-MRAM as
L1 cache where access operations are frequently needed. By contrast, STT-MRAM
may be a preferred candidate for L2 cache or LLC, where the large capacity and low
standby power are main concerns. In particular, recently sub-5 ns write latency has
been experimentally demonstrated with the perpendicular STT-MRAM [9, 24–27],
which is sufficiently satisfying for the LLC.
From the viewpoint of application, a number of related works demonstrated that
the STT-MRAM cache is more suitable for the read-intensive application [21–23].
Moreover, the cache miss rate can be reduced due to the large-capacity advantages of
the STT-MRAM. However, the performance of the STT-MRAM cache is degraded
as the write operation becomes more frequent. The reason behind this observation is
that the write latency and write power of the STT-MRAM are worse than the SRAM,
in agreement with the above analysis. These results demonstrate that proper design
strategies are indispensable for the optimization of the STT-MRAM cache, which
will be presented in the following context.
write energy. Experiments show that the EWT will reduce 32% write power in a
16 MB STT-MRAM L2 cache.
Second, the work presented in [29] deals with the so-called ‘write block’ issue in
the STT-MRAM cache. In the cache framework the read and write operations share
the same I/O port. As a result, the relatively large write latency of the STT-MRAM
will block the read requests which should be executed with a higher priority. In
[29], Obstruction-Aware Policy (OAP) was proposed to monitor the miss rate of the
L2 cache. In the L2 cache, a cache miss induces a data fetch from off-chip main
memory, which leads to a large delay of several hundred cycles. While the system is
running, the OAP will check the miss rate periodically. If the miss rate is larger than
a threshold, the write access will bypass this level cache. Instead of waiting in the
access queue, the data will be written directly to the main memory. In this term, the
write operation of the STT-MRAM does not occupy the access I/O port. As a result,
the read performance can be improved by 14% for an 8 MB shared STT-MRAM L2
cache.
Third, it is experimentally demonstrated that the critical STT current for P-to-
AP switching is intrinsically larger than that for AP-to-P switching [30]. Such an
asymmetry degrades the write performance of the STT-MRAM. To solve this issue,
an asymmetrical compensation technique was proposed [31]. If the number of the ‘0’
(AP state) in a data unit is larger than ‘1’ (P state), the data unit is flipped in order that
the AP-to-P switching occurs more frequently than P-to-AP switching. The flipped
data will be recovered at the output port. Furthermore, a uniform refresh mechanism
was proposed in [32] to alleviate the write asymmetry (see Fig. 1.6). Initially, all the
Memory Controller
N Write Y
intensive?
IC4 IC4 IC5 DC5 DC6 DC6 IC7 DC7 MRAM SRAM
Core4 Core5 Core6 Core7
Fig. 1.7 a Schematic of a typical hybrid SRAM-MRAM cache architecture [34]. b In a hybrid
SRAM-MRAM cache, the write-intensive data is handled by the SRAM while the other data by the
MRAM
bits in a cache line are set to AP states (‘L’ in Fig. 1.6). Then, the coming data is
preferred to be written into this refreshed cache line. Thereby only AP-to-P switching
occurs during the write operation. With this idea, the system achieves 35% power
reduction.
Above techniques aim to optimize a fully STT-MRAM cache, where the SRAM
is absent. By contrast, some researchers acknowledge the fact that the write per-
formance of the STT-MRAM is not as good as the SRAM. They propose a new
architecture called hybrid SRAM-MRAM cache (see Fig. 1.7), where the SRAM
handles the frequently-accessed ‘hot data’ and the STT-MRAM stores another data
in a nonvolatile way. Proper management policies are required to take the advantages
of hybrid SRAM-MRAM [22, 33–36]. For example, it was suggested in [22] that the
write-intensive workloads are executed in the SRAM instead of MRAM to avoid the
large write cost. The data transfer between SRAM and MRAM causes the migration
overhead. Thereby a loop retiming technique was proposed to reduce the migration
[35].
Temperature effect of the STT-MRAM is also paid massive attention. The work
presented in [37] tried to figure out the thermal properties of a 1T1J STT-MRAM cell.
Figure 1.8 indicates that both the write energy and latency decrease rapidly as temper-
ature increases while the read ones vary slightly. In addition, in [38] the peak tempera-
ture of Intel Haswell architecture CPU is evaluated. As shown in Fig. 1.9a, the thermal
gradient can exceed 30 °C. Then, combining these points with the unique tempera-
ture properties of the STT-MRAM cell, the authors of [38] proposed a thermal-aware
cache data replacement policy, called ‘Thermosiphon’.
In ‘Thermosiphon’ policy, two counters have been introduced as shown in
Fig. 1.9b. With the different count rules, the access counter and ratio counter create a
comparison platform for read and write access weight. Then all the cache sets in the
LLC are split into different regions based on the thermal distribution (see Fig. 1.9c).
1 Towards Spintronics Nonvolatile Caches 11
(a) (b)
12.5 1.25 4 100
-25°C AP State
10.0 1.0 3 25°C 0.0
Voltage(V)
85°C Vdata0 and V ref 85°C
0.75 2 125°C Write Pulse -100
I(µA)
7.5
I(µA)
V(V)
125°C -200
5.0 0.5 1
P State -300
2.5 -25°C 0.25 0 -400
°
25 C Write current
0.0 0.0 -1 -500
3.0 3.5 4.0 4.5 0.0 2.4 4.8 7.2 9.6 12.0(ns)
Time(ns) Time(ns)
(c) (d)
12.5 1.25 4 Write current 500
AP State
10.0 1.0 3 400
Voltage(V)
85°C Vdata1 and V ref
7.5 0.75 300
I(µA)
I(µA)
2
V(V)
125°C
5.0 0.5 200
1
2.5 -25°C 0.25 Write Pulse P State 100
0.0 0.0 0 0.0
25°C
-2.5 -0.25 -1 -100
3.0 3.25 3.5 3.75 4.0 4.25 4.5 0.0 2.4 4.8 7.2 9.6 12.0(ns)
Time(ns) Time(ns)
Fig. 1.8 Effect of the temperature on the read/write performance. a and c Read latencies for ‘0’ and
‘1’, respectively, at different temperatures. b and d Write current for AP and P states, respectively,
at different temperatures [37]
(a)
(b) write access read access
332K
+1 +1
Queue,Uncore, I/O
data access cnt. ratio cnt.
Core Core
Cache block
Core Core
LLC
boundary bank
Core Core 310K (c) 3
hot region cool region
01
data 10
data 11
data 10
Core Core
data
1 2 3 4 5 6 7 8
2 1
Fig. 1.9 a Thermal map of eight-core Intel Haswell architecture b, c thermosiphon policy: the data
can be adaptively distributed to different temperature regions [38]
As discussed above, STT-MRAM cell can offer lower write power consumption and
shorter write delay under the assistance of thermal effect. Therefore, within differ-
ent temperature regions, the system could adaptively migrate write-intensive data
depending on the two pre-set counter values comparison results of each cache line
within the same cache set. In this term, unlike the widely adopted Least Recently
Used (LRU) data replacement policy, in ‘Thermosiphon’ more write-intensive data
will be migrated into the profitable (hot) region of a spatial cache set with the least
sacrifice of read performance. Compared to the conventional LRU policy which is
12 Z. Wang et al.
write directly
Block A
(b)
LFD data (FILPData B < Treshold)
hit
Block A Block A’
Block A’ Block A
Error
check exchange with LRU block
in strong region
Fig. 1.10 ‘Sliding Basket’ policy. Data are adaptively handled depending on the FLIP possibility
[45]
that, compared with NVSim, the proposed framework is more accurate for STT-
MRAM cache simulation.
Power gating (PG) technique can be used for the purpose of low power consumption
[50]. As shown in Fig. 1.11, PG means that the power supply is cut off when there
is no application running for a long time, resulting in zero leakage power. Before
triggering the PG in a SRAM, the data have to be moved to the low-level cache to
avoid losing them. Once the application is restarted, those backup data need to be
recovered from the low-level cache, causing large energy overhead. This problem can
be solved by using the STT-MRAM since the PG cannot cause the data loss thanks
to the nonvolatility. Nevertheless, as mentioned above, the write speed/energy of the
STT-MRAM is poorer than that of the SRAM. Thus, the zero-leakage merit of the
STT-MRAM is not usable if the applications are running frequently.
14 Z. Wang et al.
Fig. 1.11 a Power gating (PG) technique for the SRAM. The power supply is cut off to reduce the
leakage power if there is no running application. b For the MRAM, the leakage power is nearly
zero even if the PG technique is not applied [50]
F P
P F
To combine the high-speed of the SRAM with the zero leakage power of the
MRAM, the hybrid NV-SRAMs cell was designed with the STT-MTJs [50–52].
Figure 1.12 shows a typical 6T2J NV-SRAM bit-cell [50], which can work in two
different modes. In the normal mode, this NV-SRAM operates as a SRAM. In the PG
mode, when the power supply is cut off, the data is stored into a couple of complemen-
tary MTJs and thus the NV-SRAM operates as a MRAM. The power consumption
can be significantly reduced by making the PG time as long as possible. The fast
write operation is guaranteed since the data is written into the SRAM in the normal
mode whereas the MTJs are only responsible for the data backup. Nevertheless, the
static power of the NV-SRAM cannot be totally eliminated as the leakage path still
exists in the normal mode (see Fig. 1.12). In addition, the NV-SRAM induces an area
penalty due to the additional peripheral circuits required by the PG technique.
1 Towards Spintronics Nonvolatile Caches 15
(a) (b)
Complementary V read = 0.4V blt sl blc
wwl
1 bit nw
H nrt nrc
AP P
P AP AP P
rwl
L
wwl
nw
nrt nrc
C BL SA
C BL AP P P AP
rwl
Fig. 1.13 a 2T2J cell enables the differential sensing and enlarges the sensing margin thanks to
the complementary design [53]. b 3T2J cell enables the simultaneous write operation [55]
To improve the read performance of the STT-MRAM cache, a 2T2J bit-cell struc-
ture was proposed as shown in Fig. 1.13a [9, 53, 54]. The complementary data is
stored into two 1T1J cells. This data is read by a current-integral sensing scheme
and differential amplification. Compared with the conventional 1T1J bit-cell, 2T2J
counterpart doubles the sensing margin and reduces the read latency. Based on this
design, a 1-Mb STT-MRAM test chip was fabricated. Evaluation results validate the
energy-efficient advantage of this 2T2J STT-MRAM cache. An improved solution
adopting a similar idea is 3T2J bit-cell shown in Fig. 1.13b [55], where an extra
transistor connects the complementary MTJs. This design enables the simultaneous
write operation and thereby decreases the cycle time and write power consumption.
An adaptive 3T3J cell structure shown in Fig. 1.14a was proposed [56]. One
3T3J cell can store 2-bit data via the resistance combinations of three MTJs. In this
structure, the left part uses 1T1J cell to store 1-bit data (Bit0), and the right one is
a 2T2J-like structure (Bit1). Two-stage sensing is adopted to read the 3T3J cell. As
can be seen in Fig. 1.14b, during the first stage the 1T1J cell is sensed and then the
2T2J cell is read during the second stage. Obviously, the 2T2J cell can be sensed
faster than 1T1J cell. Thus, with this 3T3J cell structure, the total read latency for
2-bit data is much smaller than the standard 1T1J-based cache. Moreover, this 3T3J
cell reduces the area overhead compared with the standard 2T2J cell. In the runtime
simulation, the 3T3J cell could work at 3T and 2T modes dynamically. If the running
application is space-hungry type, 3T mode will be taken to achieve the performance
and area benefits. Alternatively, the performance-demand type application activates
the 2T mode, in which the performance is comparable to that of 2T2J cell.
Nowadays it is widely accepted that the perpendicular MTJ outperforms the in-
plane MTJ due to the reasons mentioned in Sect. 1.2. Advances in nanotechnology
16 Z. Wang et al.
Vdd
(a)
Upload network
CD
CD
BL2
SA
BL0 Bit1
SA
Bit0 BL1
BL1 CD
CD Vclamp 3T-3MTJ Cell
WL WL WL
0.75
0.5
0.25
0.0 SA1E
-0.25 1st stage read latency=2.382ns
BL0
1.0
Voltage(V)
0.75 BL1
0.5
0.25
0.0 2nd stage read latency
-0.25 =0.395ns
1.0 BL1
Voltage(V)
0.75
0.5
0.25
0.0 BL2
0.0 5.0 10.0 15.0 20.0
Time(ns)
Fig. 1.14 a 3T3J bit-cell and corresponding periphery. b Two-stage sensing waveforms for the
3T3J bit-cell [56]
make it possible to develop the perpendicular MTJ qualified as higher level cache
(e.g. L2 or L1 cache). For that purpose, the main challenge is achieving sub-5 ns
write latency with an affordable current. Recently, sub-ns STT switching speed has
been experimentally demonstrated in an 80-nm perpendicular MTJ [26]. In addition,
sufficient TMR ratio should be guaranteed for the fast read operation. A double-
interface perpendicular MTJ with a TMR ratio as high as 249% has been recently
developed. Such a high TMR ratio was obtained by using an atom-thick tungsten
spacer to enhance the spin filtering [57]. Very recently, a perpendicular MTJ showed
the competitive features for the NV-cache [27], such as sub-20 nm size, sub-3 ns
write latency, 150% TMR ratio, <100 µA write current, sufficient read margin, etc.
1 Towards Spintronics Nonvolatile Caches 17
Under the macrospin assumption, the critical STT current (Ic ) and the data-
retention time (τ ) can be calculated by (1.4)–(1.5). Here, we take the perpendicular
MTJ for instance. Longer retention time indicates that the data is more immune to
the random bit-flips, but it also means larger write current and thereby higher power
consumption. For the storage application, the data-retention time must be more than
10 years over a wide range of temperatures (e.g. −10 to 70 ◦ C for consumer applica-
tions). This requires a thermal stability barrier of about 60−80 k B T , which, however,
is unnecessarily large for the cache application. Actually the data is maintained in
the cache for much shorter time than in the storage devices, since the data in a cache
are frequently updated (e.g. 1 s retention is sufficient for L3 cache [58]). Therefore,
the relaxation of nonvolatility was proposed for the STT-MRAM cache to decrease
the write current and write latency [59, 60].
4αe
Ic = (1.4)
g
I
τ = τ0 exp 1− (1.5)
kB T Ic
In this section, we will briefly introduce two emerging spintronics technologies which
are expected to bring significant improvements over the STT-MRAM. These tech-
nologies are in the infant stage, and there is a long way before they could be used
to design the real high-performance cache. Nevertheless, some advanced researches
have been done to show the great application potential.
18 Z. Wang et al.
Despite massive progress, STT-MRAM cache has to tolerate the following intrinsic
drawbacks. First, the read and write operations share the same path, and thus it is a
dilemma to optimize the read and write performance at the same time. Second, the
STT is expressed as (1.6). It is the thermal fluctuation that induces a small angle
between m and m p and thus triggers the STT. Therefore, an incubation delay is
caused and limits the STT switching speed. These drawbacks have to be compensated
through the cell-level or architecture-level strategies while designing a STT-MRAM
cache.
τ ST T ∝ λ DL JST T m × m × m p + λ F L JST T m × m p (1.6)
where JST T is the STT current density, m and m p are the unit magnetization vectors
of the free layer and pinned layer, respectively. λ DL and λ F L are the coefficients
describing the strengths of damping-like and field-like torques, respectively.
Above two drawbacks of the STT can be overcome by an emerging mechanism
called spin orbit torque (SOT) [63–66]. The schematic structure of the SOT-MTJ is
shown in Fig. 1.15, where a MTJ is deposited above a heavy-metal strip. A current
passing the heavy-metal strip can induce the SOT which drives the magnetization
switching of the free layer. The origin of the SOT may be spin Hall effect or Rashba
effect. But the quantitative ratio of these two effects cannot be easily determined.
The SOT can be expressed as
τ S O T ∝ λ DL JS O T m × (m × σ) + λ F L JS O T m × σ (1.7)
where JS O T is the SOT current density, σ is the unit polarization vector of the SOT-
induced spin injection. For the in-plane MTJ, m is nearly parallel to σ, and thus the
situation is similar to that of the STT. The incubation delay is still not eliminated.
Nevertheless, the efficiency of the SOT is much higher than that of the STT due to the
(a) (b)
Capping layers Structure inversion
Z asymmetry
Reference layer
Y MTJ Barrier
X Free layer
Heavy metal
Rashba field
Charge current
Fig. 1.15 Schematic structures of the spin orbit torque (SOT) based devices. a and b Describe the
SOT mechanisms from the viewpoints of spin Hall effect and Rashba effect, respectively [66]
1 Towards Spintronics Nonvolatile Caches 19
following reason. Consider the spin Hall effect, the ratio of spin-polarized current
and charge current can be larger than 1 by adjusting the lateral area of the heavy-
metal [see (1.8)], but it is smaller than 1 in the case of the STT-MTJ. In addition,
SOT current is allowed to be further increased to improve the switching speed, since
it passes the heavy-metal strip without the risk of barrier breakdown.
IS SM T J
= S H (1.8)
IC SH M
where I S and IC are spin current and charge current, respectively. S M T J and S H M
are the MTJ planar area and heavy-metal lateral area, respectively. S H is the spin
Hall angle.
For the perpendicular MTJ, m is nearly vertical to σ, thus the SOT is much
stronger so that the incubation delay can be eliminated. Recently, ultrafast sub-ns
magnetization switching has been experimentally demonstrated with perpendicular
SOT-MTJ [67], which is comparable to the write speed of the SRAM. The SOT
write current density is 1011 to 1012 A/m2 . Consider a heavy-metal strip with 5 ×
40 nm2 lateral area, the SOT write current is 20−200 µA, comparable to a 40 nm
perpendicular STT-MTJ. Therefore, the perpendicular SOT-MTJ is a competitive
candidate for high-level cache (e.g. L1 cache).
Recently, some simulation studies have been done to evaluate the potential of
the SOT-MRAM cache [68, 69, 1]. Typical bit-cell array is shown in Fig. 1.16,
where each SOT-MTJ is connected with two access transistors for read and write
operations, respectively. The sensing amplifier reads the MTJ state by comparing the
read current with a reference value. The write driver generates bidirectional currents
which flow through the heavy-metal strip and switches the magnetization of the free
RWL[N]
IRead
IWrite
WWL[N]
RWL[N+1]
WWL[N+1]
Fig. 1.17 Schematic structure of the NAND-like spin orbit torque device [70, 71]. The storage
density is improved by sharing the SOT current
layer. System-level simulation results demonstrate that the write performance of the
cache is significantly improved by replacing the STT-MRAM with SOT-MRAM.
However, SOT-MRAM is also bottlenecked by some shortcomings which need
to be overcome. First, SOT-MRAM cell requires two access transistors and induces
the area penalty compared to the 1T1J STT-MRAM cell. Recently NAND-like SOT
device shown in Fig. 1.17 has been proposed to alleviate this issue by sharing the
SOT current [70, 71]. Second, the heavy-metal stipe causes a loss of TMR ratio
since the read current has to pass it. Thus a higher TMR ratio is needed for the high-
speed and high-reliability read operation. Third, for the perpendicular SOT-MRAM,
an additional magnetic field is required for the deterministic switching, which pro-
hibits the practical realization of the SOT-MRAM cache. A switching mechanism
called spin-Hall-assisted STT (SHA-STT) has been proposed for the field-free write
operation of the perpendicular SOT-MRAM [72, 73]. The SOT is used to eliminate
the incubation delay while the STT determines the polarity of the write operation.
Simulation results shown in Fig. 1.18 validate the ultrafast deterministic switch-
ing induced by SHA-STT. System-level evaluation was also performed to validate
the performance improvement of the SHA-STT-MRAM cache [74], as shown in
Fig. 1.19. Recently, the interplay effect between SOT and STT has been experimen-
tally demonstrated in a three-terminal MTJ [75], which validates the feasibility of
the SHA-STT. Based on these experimental results, a novel memory called toggle
spin torques (TST) MRAM has also been proposed for upper level caches [76].
Besides, recent experiments demonstrate that the field-free SOT-induced magneti-
zation switching can be achieved with ferromagnet/antiferromagnet bilayers, where
an exchange bias is induced to replace the required magnetic field [77, 78]. This
observation offers another solution to ultrafast SOT-MRAM cache.
(a) (b)
1 1
z 0 z 0
—1 —1
1 1
1 1
0 0
y 0 y 0
—1 —1 x —1 —1 x
Fig. 1.18 Typical trajectories of the magnetization driven by the a STT and b SHA-STT. Here the
magnetization is switched from –z to +z direction. It is clearly seen that the incubation delay can
be eliminated by the SHA-STT
Read energy
1.5
Area
1
2 1
0.5
0 0 0
K
12 K
8K
2K
1M
2M
4M
8M
M
K
12 K
8K
2K
1M
2M
4M
8M
M
K
12 K
8K
2K
1M
2M
4M
8M
M
6K
6K
6K
32
64
32
64
32
64
16
16
16
51
51
51
25
25
25
>7.1
(d) (e) 2 (f)
Leakage power
4 1
Write energy
Write latancy
1
2 0.5
0 0 0
K
12 K
8K
2K
1M
2M
4M
8M
M
K
12 K
8K
2K
1M
2M
4M
8M
M
K
12 K
8K
2K
1M
2M
4M
8M
M
6K
6K
6K
32
64
32
64
32
64
16
16
16
51
51
51
25
25
25
Fig. 1.19 Architecture-level results of performance comparison amongst the SRAM, STT-MRAM
and SHA-STT-MRAM caches [74]. Here the results are normalized to the SRAM cache
motes or represses the interfacial PMA depending on the polarity of the applied
voltage. This mechanism can be modeled by (1.9)–(1.10). From the viewpoint of
the energy barrier, a positive or negative voltage can lower or increase the energy
barrier for the magnetization switching. Two regimes can be identified depending on
the amplitude of the applied voltage: (i) if the positive voltage is sufficiently large
to fully eliminate the energy barrier, then the magnetization of the free layer will
become precessionally unstable and will walk back and forth between upwards and
downwards directions, named precessional regime. (ii) Otherwise, the energy barrier
is not fully eliminated, then thermal activation or magnetic field or STT is required to
switch the magnetization of the free layer, named thermal-activation regime. Unlike
22 Z. Wang et al.
(a) (b)
Vb < 0
VCMA-MTJ
Vb = 0
Pinned layer
“P” “AP”
Vb Tunnel barrier Vb < Vc
Free layer
Vb = Vc
Vb > Vc
Fig. 1.20 a Schematic of a VCMA-MTJ device; b illustration of the impacts of various bias voltages
on the energy barrier of a VCMA-MTJ device [83]
the STT or SOT, VCMA induces the magnetization switching through the voltage
instead of current. Therefore the write power can be significantly decreased.
(a) (b)
1 1
P AP
1.2 V 1.2V
0.8V
tb STT-assisted VCMA
0 t b = 0.20 ns
tb
mz
mz
0 AP P t b = 0.30 ns
Fig. 1.21 Time-resolved evolutions of the magnetization of the free layer in the presence of the
VCMA effect [83]. The VCMA-MTJ switching operates in the a precessional regime; and b thermal-
activation regime with STT assistance
the write latency of the VCMA-MRAM is competitive to replace the SRAM. Like
the other types of the MRAM, the advantages in the area, read energy/latency, and
leakage energy over the SRAM is also kept by the VCMA-MRAM.
In this chapter, we reviewed the efforts towards the MRAM-based NV cache. Among
the various types of MRAMs, the STT-MRAM is attracting much more research
interests than others. Both commercial products and experimental prototypes of the
STT-MRAMs have been demonstrated. Meanwhile, both standalone and embed-
ded applications with the STT-MRAM have been explored. These advancements
encourage the researchers to develop the STT-MRAM-based cache. However, this
goal is blocked by a fact that the write performance of the STT-MRAM is poorer
compared with the conventional SRAM. For overcoming this weakness, a number
of researchers who may be physicist, electronics scientists, or computer experts,
proposed massive optimization strategies at device-level, cell-level, circuit-level or
architecture-level. These works were summarized in the main body of this chapter.
Besides, another route for high-performance MRAM cache is to revolutionize the
mechanism of the magnetization switching. Recently proposed SOT and VCMA
have shown promising potential for high-speed low-power MRAM. Nevertheless,
many intrinsic difficulties at the device level need to be solved before they can be
used to design the NV-Cache. In addition, other spintronics concepts such as domain-
wall racetrack memory and skyrmions have also been attempted in the design of the
NV-Cache [85–90], although they are not covered by this chapter. We believe that
24 Z. Wang et al.
the above technologies will coexist for a long time during the exploration of the
MRAM-based cache.
Acknowledgements This work was supported by the National Natural Science Foundation of
China (61704005, 61501013 and 61571023), the National Key Technology Program of China
(2017ZX01032101), and the International Mobility Project (B16001 and 2015DFE12880).
References
19. S. Ikeda, K. Miura, H. Yamamoto, K. Mizunuma, H.D. Gan, M. Endo, S. Kanai, J. Hayakawa,
F. Matsukura, H. Ohno, A perpendicular-anisotropy CoFeB–MgO magnetic tunnel junction.
Nat. Mater. 9(9), 721–724 (2010)
20. M. Hosomi, et al., A novel nonvolatile memory with spin torque transfer magnetization switch-
ing: spin-RAM, in IEEE-IEDM (2005), pp. 459–462
21. A. Maashri, G. Sun, X. Dong, V. Narayanan, Y. Xie, 3D GPU architecture using cache stacking:
Performance, cost, power and thermal analysis, in IEEE-ICCD (2009), pp. 254–259
22. G. Sun, X. Dong, Y. Xie, J. Li, Y. Chen, A novel architecture of the 3D stacked MRAM L2
cache for CMPs, in IEEE-HPCA (2009), pp. 239–249
23. X. Dong, et al., Circuit and microarchitecture evaluation of 3D stacking magnetic RAM
(MRAM) as a universal memory replacement, in ACM/IEEE DAC (2008), pp. 554–559
24. G. Jan, et al., Demonstration of fully functional 8 Mb perpendicular STT-MRAM chips with
sub-5 ns writing for non-volatile embedded memories, in IEEE Symposium on VLSI Technology
(2014), pp. 42–43
25. D. Saida, et al., Sub-3 ns pulse with sub-100 µA switching of 1x–2x nm perpendicular MTJ
for high-performance embedded STT-MRAM towards sub-20 nm CMOS, in IEEE Symposium
on VLSI Technology (2016), pp. 1–2
26. G. Jan, et al., Achieving sub-ns switching of STT-MRAM for future embedded LLC appli-
cations through improvement of nucleation and propagation switching mechanisms, in IEEE
Symposium on VLSI Technology (2016), pp. 1–2
27. D. Saida, S. Kashiwada, M. Yakabe, T. Daibou, M. Fukumoto, S. Miwa, Y. Suzuki, K. Abe,
H. Noguchi, J. Ito, S. Fujita, 1x–2x nm perpendicular MTJ switching at sub-3-ns pulses below
100 µA for high-performance embedded STT-MRAM for sub-20-nm CMOS. IEEE Trans.
Electron Devices 64(2), 427–431 (2017)
28. P. Zhou, B. Zhao, J. Yang, Y. Zhang, Energy reduction for STT-RAM using early write termi-
nation, in IEEE/ACM ICCAD (2009), pp. 264–268
29. J. Wang, X. Dong, Y. Xie, OAP: an obstruction-aware cache management policy for STT-RAM
last-level caches, in DATE (2013), pp. 847–852
30. C.J. Lin, et al., 45 nm low power CMOS logic compatible embedded STT MRAM utilizing a
reverse-connection 1T/1MTJ cell, in IEEE-IEDM (2009), pp. 1–4
31. K. Ikegami, et al., Low power and high density STT-MRAM for embedded cache memory
using advanced perpendicular MTJ integrations and asymmetric compensation techniques, in
IEEE-IEDM (2014), pp. 28.1.1–28.1.4
32. G. Sun, Y. Zhang, Y. Wang, Y. Chen, Improving energy efficiency of write-asymmetric mem-
ories by log style write, in ISLPED (2012), pp. 173–178
33. X. Wu, J. Li, L. Zhang, E. Speight, Y. Xie, Power and performance of read-write aware hybrid
caches with non-volatile memories, in DATE (2009), pp. 737–742
34. J. Li, C. Xue, Y. Xu, STT-RAM based energy-efficiency hybrid cache for CMPs, in IEEE/IFIP
VLSI-SoC (2011), pp. 31–36
35. K. Qiu, M. Zhao, Q. Li, C. Fu, C. Xue, Migration-aware loop retiming for STT-RAM-based
hybrid cache in embedded systems. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
33(3), 329–342 (2014)
36. A. Sharifi, M. Kandemir, Automatic feedback control of shared hybrid caches in 3D chip
multiprocessors, in International Euromicro Conference on PDP (2011), pp. 393–400
37. B. Wu, Y. Cheng, J. Yang, A. Todri-Sanial, W. Zhao, Temperature impact analysis and access
reliability enhancement for 1T1MTJ STT-RAM. IEEE Trans. Reliab. 65(4), 1755–1768 (2016)
38. B. Wu, et al., Thermosiphon: a thermal aware NUCA architecture for write energy reduction
of the STT-MRAM based LLCs, in IEEE/ACM ICCAD (2017), pp. 474–481
39. C. Kim, D. Burger, S.W. Keckler, An Adaptive, non-uniform cache structure for wire-delay
dominated on-chip caches, in ACM-ASPLOS (2002), pp. 211–222
40. W. Zhao et al., Failure and reliability analysis of STT-MRAM. Microelectron. Reliab. 52(9–10),
1848–1852 (2011)
41. D. Zhang, L. Zeng, T. Gao, F. Gong, X. Qin, W. Kang, Y. Zhang, Y. Zhang, J. Klein, W.
Zhao, Reliability-enhanced separated pre-charge sensing amplifier for hybrid CMOS/MTJ
logic circuits. IEEE Trans. Magn. 53(9), 1–5 (2017)
26 Z. Wang et al.
42. H. Zhang, W. Kang, T. Pang, W. Lv, Y. Zhang, W. Zhao, Dual reference sensing scheme with
triple steady states for deeply scaled STT-MRAM, in IEEE/ACM NANOARCH (2016), pp. 1–6
43. L. Zhang, et al., Channel modeling and reliability enhancement design techniques for STT-
MRAM, in ISVLSI (2015), pp. 461–466
44. M. McCartney, SRAM reliability improvement using ECC and circuit techniques, Ph.D. thesis
(2014)
45. X. Wang, M. Mao, E. Eken, W. Wen, H. Li, Y. Chen, Sliding basket: an adaptive ECC scheme
for runtime write failure suppression of STT-RAM cache, in DATE (2016), pp. 762–767
46. X. Dong, C. Xu, Y. Xie, N. Jouppi, NVSim: a circuit-level performance, energy, and area
model for emerging nonvolatile memory. IEEE Trans. Comput. Aided Des. Integr. Circuits
Syst. 31(7), 994–1007 (2012)
47. S. Wilton, N. Jouppi, CACTI: an enhanced cache access and cycle time model. IEEE J. Solid-
State Circuits 31(5), 677–688 (1996)
48. B. Wu, et al., An architecture-level cache simulation framework supporting advanced PMA
STT-MRAM, in IEEE/ACM NANOARCH (2015), pp. 7–12
49. E. Eken, et al., NVSim-VXs: an improved NVSim for variation aware STT-RAM simulation,
in ACM/EDAC/IEEE-DAC (2016), pp. 1–6
50. K. Abe, et al., Novel hybrid DRAM/MRAM design for reducing power of high performance
mobile CPU, in IEEE-IEDM (2012), pp. 10.5.1–10.5.4
51. S. Yamamoto, S. Sugahara, Nonvolatile static random access memory using magnetic tunnel
junctions with current-induced magnetization switching architecture. Jpn. J. Appl. Phys. 48(4),
043001 (2009)
52. T. Ohsawa, et al., 1 Mb 4T-2MTJ nonvolatile STT-RAM for embedded memories using 32b
fine-grained power gating technique with 1.0 ns/200 ps wake-up/power-off times, in Symposium
on VLSIC (2012), pp. 46–47
53. H. Noguchi, et al., A 250-MHz 256b-I/O 1-Mb STT-MRAM with advanced perpendicular
MTJ based dual cell for nonvolatile magnetic caches to reduce active power of processors, in
Symposium on VLSI Technology (2013), pp. 108–109
54. H. Noguchi, et al., Highly reliable and low-power nonvolatile cache memory with advanced per-
pendicular STT-MRAM for high-performance CPU, in Symposium on VLSIC (2014), pp. 1–2
55. A. Kawasumi, et al., Circuit techniques in realizing voltage-generator-less STT MRAM suitable
for normally-off-type non-volatile L2 cache memory, in IEEE-IMW (2013), pp. 76–79
56. L. Xue, B. Wu, B. Zhang, Y. Cheng, P. Wang, C. Park, J. Kan, S. Kang, Y. Xie, An adaptive
3T-3MTJ memory cell design for STT-MRAM-based LLCs. IEEE Trans. Very Large Scale
Integr. (VLSI) Syst. 26(3), 484–495 (2018)
57. M. Wang, W. Cai, K. Cao, J. Zhou, J. Wrona, S. Peng, H. Yang, J. Wei, W. Kang, Y. Zhang, J.
Langer, B. Ocker, A. Fert, W. Zhao, Current-induced magnetization switching in atom-thick
tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresis-
tance. Nat. Commun. 9(1) (2018)
58. K. Ikegami, et al., MTJ-based ‘normally-off processors’ with thermal stability factor engineered
perpendicular MTJ L2 cache based on 2T-2MTJ cell L3 and last level cache based on 1T-1MTJ
cell and novel error handling scheme, in IEEE-IEDM (2015), pp. 25.1.1–25.1.4
59. C.W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, M.R. Stan, Relaxing non-volatility for
fast and energy-efficient STT-RAM caches, in IEEE-HPCA (2011), pp. 50–61
60. H. Li, X. Wang, Z. Ong, W. Wong, Y. Zhang, P. Wang, Y. Chen, Performance, power, and
reliability tradeoffs of STT-RAM cell subject to architecture-level requirement. IEEE Trans.
Magn. 47(10), 2356–2359 (2011)
61. A. Jog, et al., Cache revive: architecting volatile STT-RAM caches for enhanced performance
in CMPs, in DAC (2012), pp. 243–252
62. Z. Sun, et al., Multi retention level STT-RAM cache designs with a dynamic refresh scheme,
in IEEE/ACM MICRO (2011), pp. 329–338
63. I. Miron, K. Garello, G. Gaudin, P. Zermatten, M. Costache, S. Auffret, S. Bandiera, B. Rod-
macq, A. Schuhl, P. Gambardella, Perpendicular switching of a single ferromagnetic layer
induced by in-plane current injection. Nature 476(7359), 189–193 (2011)
1 Towards Spintronics Nonvolatile Caches 27
64. L. Liu, C. Pai, Y. Li, H. Tseng, D. Ralph, R. Buhrman, Spin-torque switching with the giant
spin Hall effect of tantalum. Science 336(6081), 555–558 (2012)
65. M. Cubukcu, O. Boulle, M. Drouard, K. Garello, C. Onur Avci, I. Mihai Miron, J. Langer,
B. Ocker, P. Gambardella, G. Gaudin, Spin-orbit torque magnetization switching of a three-
terminal perpendicular magnetic tunnel junction. Appl. Phys. Lett. 104(4), 042406 (2014)
66. Z. Wang, Z. Li, Y. Liu, S. Li, L. Chang, W. Kang, Y. Zhang, W. Zhao, Progresses and challenges
of spin orbit torque driven magnetization switching and application, in IEEE-ISCAS (2018)
67. M. Cubukcu, O. Boulle, N. Mikuszeit, C. Hamelin, T. Bracher, N. Lamard, M. Cyrille, L.
Buda-Prejbeanu, K. Garello, I. Miron, O. Klein, G. de Loubens, V. Naletov, J. Langer, B.
Ocker, P. Gambardella, G. Gaudin, Ultra-fast perpendicular spin-orbit torque MRAM. IEEE
Trans. Magn. 54(4), 1–4 (2018)
68. J. Kim, et al., Spin-Hall effect MRAM based cache memory: a feasibility study, in DRC (2015),
pp. 117–118
69. R. Bishnoi, M. Ebrahimi, F. Oboril, M.B. Tahoori, Architectural aspects in design and analysis
of SOT-based memories, in ASP-DAC (2014), pp. 700–707
70. Z. Wang, L. Zhang, M. Wang, Z. Wang, D. Zhu, Y. Zhang, W. Zhao, High-density NAND-like
spin transfer torque memory with spin orbit torque erase operation. IEEE Electron Device Lett.
39(3), 343–346 (2018)
71. H. Yoda, et al., Voltage-control spintronics memory (VoCSM) having potentials of ultra-low
energy-consumption and high-density, in IEEE-IEDM (2016), pp. 27.6.1–27.6.4
72. Z. Wang, W. Zhao, E. Deng, J. Klein, C. Chappert, Perpendicular-anisotropy magnetic tunnel
junction switched by spin-Hall-assisted spin-transfer torque. J. Phys. D Appl. Phys. 48(6),
045001 (2015)
73. A. van den Brink, S. Cosemans, S. Cornelissen, M. Manfrini, A. Vaysset, W. Van Roy, T. Min,
H. Swagten, B. Koopmans, Spin-Hall-assisted magnetic random access memory. Appl. Phys.
Lett. 104(1), 012403 (2014)
74. L. Chang, et al., Evaluation of spin-Hall-assisted STT-MRAM for cache replacement, in
IEEE/ACM NANOARCH (2016), pp. 73–78
75. M. Wang et al., Field-free switching of a perpendicular magnetic tunnel junction through the
interplay of spin–orbit and spin-transfer torques. Nat. Electron. 1, 582–588 (2018)
76. Z. Wang et al., Proposal of Toggle Spin Torques Magnetic RAM for Ultrafast Computing.
IEEE Electron Device Lett 40(5), 726–729 (2019)
77. S. Fukami, C. Zhang, S. DuttaGupta, A. Kurenkov, H. Ohno, Magnetization switching
by spin–orbit torque in an antiferromagnet–ferromagnet bilayer system. Nat. Mater. 15(5),
535–541 (2016)
78. Y. Oh, S. Chris Baek, Y. Kim, H. Lee, K. Lee, C. Yang, E. Park, K. Lee, K. Kim, G. Go, J.
Jeong, B. Min, H. Lee, K. Lee, B. Park, Field-free switching of perpendicular magnetization
through spin–orbit torque in antiferromagnet/ferromagnet/oxide structures. Nat. Nanotechnol.
11(10), 878–884 (2016)
79. W. Wang, M. Li, S. Hageman, C. Chien, Electric-field-assisted switching in magnetic tunnel
junctions. Nat. Mater. 11(1), 64–68 (2011)
80. J.G. Alzate, et al., Voltage-induced switching of nanoscale magnetic tunnel junctions, in IEEE-
IEDM (2012), pp. 29.5.1–29.5.4
81. K. Wang, H. Lee, P.Khalili Amiri, Magnetoelectric random access memory-based circuit design
by using voltage-controlled magnetic anisotropy in magnetic tunnel junctions. IEEE Trans.
Nanotechnol. 14(6), 992–997 (2015)
82. W. Kang, Y. Ran, Y. Zhang, W. Lv, W. Zhao, Modeling and exploration of the voltage-controlled
magnetic anisotropy effect for the next-generation low-power and high-speed MRAM appli-
cations. IEEE Trans. Nanotechnol. 16(3), 387–395 (2017)
83. W. Kang, L. Chang, Y. Zhang, W. Zhao, Voltage-controlled MRAM for working memory:
perspectives and challenges, in DATE (2017), pp. 542–547
84. S. Kanai, Y. Nakatani, M. Yamanouchi, S. Ikeda, H. Sato, F. Matsukura, H. Ohno, Magnetization
switching in a CoFeB/MgO magnetic tunnel junction by combining spin-transfer torque and
electric field-effect. Appl. Phys. Lett. 104(21), 212406 (2014)
28 Z. Wang et al.
85. H. Xu, Y. Li, R. Melhem, A.K. Jones, Multilane racetrack caches: improving efficiency through
compression and independent shifting, in ASP-DAC (2015), pp. 417–422
86. X. Zhang, L. Zhao, Y. Zhang, J. Yang, Exploit common source-line to construct energy efficient
domain wall memory based caches, in IEEE-ICCD (2015), pp. 157–163
87. R. Venkatesan et al., Cache design with domain wall memory. IEEE Trans. Comput. 65(4),
1010–1024 (2016)
88. W. Kang, C. Zheng, Y. Huang, X. Zhang, W. Lv, Y. Zhou, W. Zhao, Compact modeling and
evaluation of magnetic skyrmion-based racetrack memory. IEEE Trans. Electron Devices 64(3),
1060–1068 (2017)
89. W. Kang, Y. Huang, X. Zhang, Y. Zhou, W. Zhao, Skyrmion-electronics: an overview and
outlook. Proc. IEEE 104(10), 2040–2061 (2016)
90. F. Chen et al. Process variation aware data management for magnetic skyrmions racetrack
memory, in ASP-DAC (2018), pp. 221–226
Chapter 2
CMOS-OxRAM Based Hybrid
Nonvolatile SRAM and Flip-Flop:
Circuit Implementations
Abstract A critical technological challenge over the past few decades has been
to achieve low-power operation without sacrificing performance. This led to the
development of computing units that can normally be turned off when not in use
and turned on instantly with full performance, when required thereby helping in
eliminating leakage power. However, with direct power-down, the states in local
memories (SRAM) and volatile registers (SRAM-based flip-flop) will be lost. Thus,
to have a power-down mode in SRAM-based memories and Flip-Flops (FFs), the
data states are off-loaded to an external nonvolatile storage array, thus giving rise
to NV-SRAM/NV-FF circuits (i.e. nonvolatile SRAM/nonvolatile Flip-flop). In this
chapter, we present a real-time 4T-2R NV-SRAM bitcell using HfOx based OxRAM
(oxide based random access memory) devices. We will discuss the working principle,
programming methodologies and the stability of NV-SRAM bitcell. We will further
present a novel NV-FF design based on 4T-2R NV-SRAM bitcell and will provide
an insight into its working and operating modes.
2.1 Introduction
With the advent of technologies like wireless sensors, bio-medical implants and
internet-of-things (IoTs), ultra low-power operation and “normally-off instant power-
on” mode have become an absolute necessity [1–3]. These systems have spo-
radic wake-up times, and thus the leakage power is a dominant phenomena in the
power consumption for such systems. To minimize the leakage power, power gat-
ing approach has been proposed, where a lower voltage (hold voltage) is used for
volatile memory to retain data while all logic circuits are turned off [4]. However, even
maintenance of this hold voltage (during power-down mode) in high-performance
processing units, leads to a huge power dissipation due to leakage current, which
is ≈40% of the dynamic energy [5]. Even worse, during abrupt power failures the
data in volatile memory is lost and computation tasks have to be restarted. This hap-
pens due to the volatile nature of CMOS memory cells used in conventional CPUs
such as SRAM-based caches and flip-flop (FF) based register files. To mitigate these
issues, different circuits have been designed to back-up data from on-chip memory
(SRAM), FFs and registers to off-chip nonvolatile memory (NVM) thus preserving
the system state in case of power failures. This is known as two-macro scheme, i.e.
SRAM (for faster access) in conjunction with NVM (for nonvolatility). However, the
main drawback of this methodology is that it requires long store/restore time due to
serial SRAM read/write and long NVM write/read procedures. This results in long
power-on/off time. Thus, the two-macro scheme is vulnerable to data loss in case of
sudden power failure [6, 7]. To address these limitations, NVM elements are directly
integrated to SRAM or FF units, where it forms a direct bit-to-bit connection in a
vertical arrangement to achieve faster parallel data transfer and turn on/off speed.
This gives rise to NV-SRAM/NV-FF units.
Emerging NVMs such as floating-gate based memories, PCM (Phase Change
Memory), FRAM (Ferroelectric RAM), OxRAM (Oxide-based RAM), CBRAM
(Conductive Bridge RAM) and STT-MRAM (Spin Transfer Torque based Mag-
netoresistive RAM) have emerged as promising solutions for realizing embedded
nonvolatile Logic. However, due to large access/programming times, high operating
voltages and limited endurance, floating gate or FLASH memories are less favored
choices. PCM devices, on the other hand, requires large current to heat the GST mate-
rial for resistive switching between crystalline and amorphous states. FRAM poses a
number of challenges owing to data signal degradation in the scaling of devices. STT-
MRAM also need large programming currents to exert a spin torque on the magnetic
moment of the free layer with respect to the fixed layer and hence leads to higher
power dissipation during the programming phase. As a result, OxRAM devices have
emerged as a great choice for hybrid CMOS-NVM based nonvolatile circuits owing to
their low cost, high density, low operating voltages, negligible leakage, access times
about 1000× faster than floating-gate memories, full CMOS compatibility, possibil-
ity of 3D integration and integration in vias [8–11]. In this chapter, we are presenting
the most important counterparts (CMOS-OxRAM) of conventional volatile mem-
ory systems, i.e. (i) NV-SRAM, and (ii) NV-FF. These hybrid nonvolatile circuits
offer advantages like: (i) nearly zero leakage, (ii) efficient backup/restore operation
and (iii) high performance and low energy. We have presented 4T-2R NV-SRAM
bitcell that offers “real-time nonvolatility”. Using this 4T-2R NV-SRAM bitcell, we
have proposed a novel NV-FF design. This chapter is organized as follows: Sect. 2.2
summarizes the different NV-SRAM/NV-FF implementations proposed in literature
so far. Section 2.3 discusses our 4T-2R NV-SRAM bitcell and explains its different
programming schemes. Section 2.4 shows our novel real-time NV-FF implementa-
tion using 4T-2R NV-SRAM bitcell and presents its operating modes. We have also
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 31
presented a modified NV-FF design that offers better system performance compared
to the aforementioned NV-FF design. The chapter concludes with Sect. 2.5.
Memory architecture use hierarchy of caches (L1, L2, last level cache (LLC), etc.)
and the optimization target for designing each cache level is different. L1 is accessed
quite frequently and therefore, it needs higher speed and write endurance whereas
LLC is targeted to minimize off-chip accesses and thereby needs large capacity.
Hence, it is recommended to have SRAM-based L1 cache for better performance
[12] whereas emerging NVMs can be used in L2 or LLC (due to their latency,
density and write endurance values). To realize a nonvolatile cache using NVM,
researchers have proposed various bitcell based optimization schemes. It is proposed
to use NV-SRAM (including a volatile- and nonvolatile circuit) for nonvolatile cache
implementation. Under normal operations (when external power is supplied), the
volatile circuit provides fast data access. When controlled power-down/sleep-mode is
enabled or there is sudden power failure, the nonvolatile circuit provides data backup,
thereby retaining data previously stored in the volatile circuit. In literature, several
different hybrid (CMOS-OxRAM/CMOS-MTJ (magnetic tunnel junction)/CMOS-
PCM) NV-SRAM designs like 9T-2R [13], 8T-2R [7, 14, 15], 8T-2MTJ [16], 8T-1R
[17], 7T-2R [18, 19], 7T-1R [20, 21], 6T-2R [22], 6T-2MTJ [23], 4T-2R [24, 25] and
4T-2MTJ [26, 27] have been proposed. Figure 2.1 shows the circuit schematic for dif-
ferent NV-SRAM implementations. These implementations differ in their approach
to store data during power-down mode. Xue et al. [13] proposed 9T-2R NV-SRAM
bitcell where they used equalization transistor connected between the storage nodes
for data restoring mode. However, the area requirements for 9T-2R is ≈230F2 com-
pared to ≈140F2 for conventional 6T SRAM. Furthermore, separate wordlines (WLs)
are required for the storage nodes that increases the count of control signals leading to
routing congestion. Chiu et al. [7, 14] proposed 8T-2R bitcell for better density com-
pared to 9T-2R. This bitcell offered BL-CL (Bitline-Control line) sharing scheme to
reduce area overhead and also enabled write-assist function. However, the drawback
with this implementation is the requirement of extra control lines for off-loading
the data for power-down mode. Moving ahead, to minimize the leakage currents,
Tasson et al. [17] proposed 8T-1R NV-SRAM bitcell. The restore time for 8T-1R is
≈2.6× compared to 8T-2R [7] due to multiple steps involved in operation and also
its read latency is higher than 9T-2R [13], 6T-2R [22] and conventional 6T SRAM.
32 S. Majumdar et al.
Fig. 2.1 Circuit schematics of different NV-SRAM bitcells proposed in literature: a 9T-2R: WLL
and WLR are separate WLs to control 1T-1R cells, b 8T-2R: SWL indicates NVM switch line,
c 8T-1R, d 7T-2R, e 7T-1R, and f 6T-2R (redrawn from [13, 14, 17, 18, 20, 22]). Variable resistance
here indicates NVM element
Several NV-FFs have been proposed over time using emerging NVM devices such
as OxRAM [5, 27–31], MTJ [32–42], ferroelectric capacitors [43–45] and transis-
tors [6, 46]. These flip-flops provide on-demand and controlled data backup and
restore whenever appropriate backup signal is triggered. However, having additional
circuit as an off-loading data block leads to area and power overheads. Therefore,
the major challenge in designing NV-FF lies in coming up with an area efficient cir-
cuit design along with high performance in terms of speed, power and energy. A lot of
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 33
Fig. 2.2 Different NV-FF schematics proposed in literature a OxRAM-based NV-FF [5] b STT-
MTJ-based NV-FF [34] c SHE-MTJ-based NV-FF [35] d Ferroelectric capacitors-based NV-FF
[46]. The figures have been redrawn from the referenced papers
developmental work has been done in designing and optimization of NV-FF. Figure 2.2
shows some of the NV-FF designs as proposed in literature. Iyenger et al. [3] pro-
posed a MTJ-based NV-FF with enhanced scan capability in two variants—Enhanced
Scan Enabled NV-FF (ES NV-FF) and High Performance ES NV-FF (HPES NV-FF).
In ES NV-FF, two parallel latches allowed enhanced scan and store-restore opera-
tions. The output of the master latch was connected to the slave latch as well as
the NV latch. The two MTJ devices are written serially during negative pulse of
the clock cycle thus limited the operating frequency of the FF. In HPES NV-FF, the
MTJ devices were written in parallel thus, the frequency of FF is not compromised.
The authors also analyzed that the cell area of ES NV-FF was ≈1.8× compared to
standard master–slave FF (MSFF) and gave a maximum frequency of 2 GHz. HPES
NV-FF had an area overhead of ≈2.5× that of MSFF with 2 GHz operating range.
In [5], a bipolar OxRAM-based NV-FF was proposed. The off-loading NVM cir-
cuit was connected to the slave part of the FF element, comprised of two OxRAM
devices whose operational modes were controlled by a group of transistors called the
NVM-L and NVM-R. Each NVM block was a 3T-1R structure that contributed in
controlling and providing current compliance to the circuit. Authors claimed that the
circuit has zero standby-leakage power and nonvolatility, at an area overhead of only
25% as compared to Balloon FF solution [47] and a 10% increase in CLK-Q delay
compared to a normal FF delay. In [33] and [48], two MTJ devices were used for off-
34 S. Majumdar et al.
loading the data from MSFF and the MTJ devices retained the off-loaded state only
during the sleep-mode. In these designs, the MTJ states were updated on every clock
cycle, which increased the power consumption, reduced the FF speed and endurance
of MTJ. Furthermore, Jung et al. [48] aimed to minimize short circuit current by
using low-skewed NAND (LS-NAND), which was used to efficiently interface the
two supply voltage levels of 1.1 and 1.8 V. In [32, 49, 50], the NV-FF was imple-
mented as a part of write driver circuit. As a result, the transistor sizes in these designs
were quite large leading to higher parasitic capacitance. This affected the operational
speed of the FF as well as its data integrity. Magnetic FF proposed by Sakimura et al.
[32] gave a maximum operating frequency of 500 MHz with 1 ns data backup time.
Endoh et al. [50] proposed a PFET based 1T-1MTJ NV-FF with operating frequency
of 600 MHz. Kazi et al. [51] proposed two OxRAM-based NV-FF exploiting sub
VT operation enabling zero leakage sleep states. The FF operated at 2 V and had a
current compliance of 10 µA. The write energy was OxRAM dependent while the
sub VT operation reduced the read energy by 5.4%. The restore operation was done
at 0.4 V. In recent work by Kang et al. [52], a voltage controlled Magnetic Anisotropy
(VCMA) NV-FF was proposed which exploited the magnetic anisotropy assistance
in faster switching of the magnetic devices used in the circuit. Authors reported
that due to the phenomena of VCMA, the current density and pulse duration can
be greatly reduced for MTJ switching. An improvement of 98.4% was observed in
data backup energy for VCMA STT-MRAM-based NV-FF and 89.5% improvement
was observed in data backup delay as compared to conventional STT-MRAM-based
NV-FF. While this methodology was beneficial for STT-MRAM-based NV-FF, the
margin of improvement in SHE-based NV-FF was small (74.6% in data backup
energy and 19% in data backup delay). Bishnoi et al. [53] proposed a 2 MTJ-based
NV-FF which reduced the static power consumption by 5× compared to CMOS
based FFs. However, the design proposed was bulky as it required 32 transistors and
2 MTJ cells as compared to 26 transistors used in conventional CMOS based FFs. A
Ferroelectric-Based Nonvolatile FF for wearable health care systems was proposed
by Izumi et al. in [54]. The FF was based on storing complementary data in coupled
ferroelectric capacitors, that enabled the reduction in the capacitor size by 88%. The
FF had a read voltage margin of 240 mV at 1.5 V, which resulted in 2.4 pJ low access
energy with 10-year (at 85 ◦ C) data-retention capability. Ali et al. [55] also proposed
a MTJ-based NV-FF which was aimed for power gating application. The proposed
design could achieve 80% less area as compared to traditional STT-MRAM-based
NV-FF with a backup energy of 111 fJ and restore energy of 6.9 fJ. The backup and
restore time achieved were 3 ns and 0.16 ns respectively.
All the above designs are based on off-loading of data when a controlled power-
down signal is applied. These designs do not take care of the fact that power outage
might also be due to glitches which leads to loss of data since the data during normal
phase is not backed up. Some designs use a battery backup for such cases where
a sudden power loss brings the FF to a battery mode which is charged enough to
backup the states to the NVM block. This battery backup clock requires extra area
and therefore increases the overhead. Moreover, the designs which do not have a
battery backup design to backup the data during sudden power-loss over-optimizes
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 35
the fact that power glitches will not corrupt the data. It is a known fact that the
circuit concepts used in developing NV-SRAM can be extended to designing NV-FFs
[44]. We therefore take into consideration the points mentioned above and come up
with a real-time data-backup-based NV-FF which is based on the 4T-2R NV-SRAM
proposed in [24].
4T-2R NV-SRAM bitcell discussed in this study is shown in Fig. 2.3a [24]. Figure 2.3b
shows the IV characteristics of 3 nm thick HfOx based OxRAM devices obtained
using compact model described in [56]. To realize the nonvolatility in 4T-2R NV-
SRAM bitcell, the pull-up transistors in SRAM bitcell are replaced by OxRAM
devices. OxRAM devices actively participate during NV-SRAM programming and
helps retaining the logic state during power-down mode. NV-SRAM bitcell has two
modes of operation: Write mode and Read mode. OxRAM devices are programmed
only during the Write Mode. True nonvolatility of the NV-SRAM bitcell is achieved
as data can be retrieved from the OxRAM devices not only after a controlled power-
down but also after an abrupt power failure.
For NV-SRAM, to encode the data in the OxRAM devices, we have proposed different
programming schemes [24, 25]. The programming schemes are classified on the basis
Fig. 2.3 a Circuit schematic of 4T-2R NV-SRAM bitcell (redrawn from [24]), b DC IV character-
istics of 3 nm thick HfOx based OxRAM device used in this study (modelled in [56])
36 S. Majumdar et al.
of their approach to program the OxRAM devices, e.g. (i) sequential programming
in which the two OxRAM devices are programmed in two cycles, and (ii) parallel
programming in which both the OxRAM devices are programmed in single cycle. The
working principle, advantages and trade-offs for the aforementioned programming
schemes are summarized below:
Fig. 2.4 During Write ‘1’ operation: switching in a Ox1 and b Ox2 devices for LRS-HRS and
HRS1-HRS2 programming schemes [24]
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 37
Fig. 2.5 Operational modes for Read and Write operations in a two-cycle programming scheme,
and b single-cycle programming scheme using pulse engineered signals at Programming line (PL)
and Bitline (BL) [25]
as the maximum current flows through it during both read and Write operations. As
the resistance of OxRAM device decreases, larger pull-down transistors are required
to handle the current flowing in the circuit. This mitigates the inherent advantage of
using fewer transistors in 4T-2R NV-SRAM design. The other disadvantages of using
LRS-HRS scheme are: higher power dissipation, sneak paths and lower SNM (Static
Noise Margin). To mitigate some of these issues, an efficient programming scheme
HRS1-HRS2 can be used instead of LRS-HRS. In HRS1-HRS2 scheme, one of the
OxRAM device is programmed using a weak-SET while the other OxRAM device
is programmed in RESET state. This lowers down the switching energy/bit and pull-
down transistor area. In this scheme, peak amplitude of PL is kept as 1.2 V (as 90 nm
CMOS uses similar voltage ranges for its operation). For Write logic ‘1’, data is
loaded to BL and its complementary data is loaded to BLB. While programming,
the effective positive VT B across the OxRAM device storing ‘1’ and the negative
VT B across the OxRAM device storing logic ‘0’ is less than the positive and negative
VT B when OxRAM was programmed using PL = 1.6 V respectively. This results in
different SET and RESET resistance states (0.68 M and 2.04 M resp.) in each
OxRAM device using HRS1-HRS2 (see Fig. 2.4). Using HRS1-HRS2, VT B for SET
switching is 313 mV (compared to 750 mV using LRS-HRS) and for RESET switch-
ing is −780 mV (compared to 797 mV using LRS-HRS). The NMOS transistor width
and write energy are lowered down to 240 nm (640 nm in LRS-HRS) and 0.414 pJ
(1.8 pJ in LRS-HRS) using energy efficient HRS1-HRS2 scheme. Detailed timing
diagram for two-cycle programming scheme is shown in Fig. 2.5a.
In this scheme, the PL and BL signals are modified in such a way that OxRAM devices
are programmed simultaneously in a single cycle. In this scheme, a triangular pulse
with equal rise and fall times is applied at the PL line providing the required amplitude
and polarity of VT B to switch the OxRAM devices in NV-SRAM simultaneously.
38 S. Majumdar et al.
Figure 2.5b shows the timing diagram for single-cycle programming scheme. For
data write ‘1’, the BL line is slowly ramped to 1.2 V while BLB line is kept at 0 V.
When the access transistors are turned on, the internal nodes Q and QB reflect the
data writes at BL and BLB. This action is supported further due to the cross-coupled
connection between the NMOS pull-down transistors (M1 and M2). Figure 2.6a
shows the triangular pulse as applied to the PL line. It can be seen that depending
on the potential difference across the device (VT B ) due to voltage values at PL and
BL/BLB, the OxRAM devices are either SET or RESET. For Node QB (as shown in
Fig. 2.6b), polarity of VT B stays negative (with peak amplitude −1.6 V) throughout
the triangular single-cycle pulse applied at PL (as BLB = 0 V). Ox2 device switches
from LRS → HRS, resulting in negligible current through it. As a result, QB stabilizes
at 0 V (logic ‘0’) and transistor M1 is turned off. Figure 2.6c, d shows the resistive
switching at Ox1 and Ox2. Due to the modulation of VT B across Ox1, the device
switches twice in the first write cycle owing to the fact that the device started from
an initial LRS state. A point to note here is that the double switching in the OxRAM
device will be a one time phenomena and will only be visible during the first write
cycle unless otherwise the devices are re-initialized. Meanwhile, the potential drop
across Ox2 will be negative for the entire write cycle. A similar phenomena is evident
when writing data ‘0’ to the bitcell. Table. 2.1 shows the comparison in the resistive
Fig. 2.6 Applied PL, BL and BLB signals during Write logic ‘1’ operation for a Node Q, b Node
QB. c Ox1 switching during RESET and SET regions (inset: switching activity during the first
cycle), and d Ox2 switching during RESET region [25]
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 39
Table 2.1 Absolute programming times for programming the OxRAM devices used in 4T-2R
NV-SRAM bitcell [25]
Peak BL (V) Programming time during NV-SRAM Programming time during NV-SRAM
Write ‘0’ (ns) Write ‘1’ (ns)
Ox1 (RESET) Ox2 (SET) Ox1 (SET) Ox2 (RESET)
1.3 316 878 875 387
1.2 326 897 897 357
1.1 326 897 917 384
1 386 970 936 316
switching parameters for data write ‘0’ and ‘1’. For the proposed methodology, the
device RESETs in 357 ns and SETs in 168 ns (logic ‘1’ write).
Impact of PL and BL Signals on Single-Cycle Programming: When considering
the single-cycle operation, the amplitude, rise and fall times and the pulse width of
the control and data signals PL and BL/BLB, are the key parameters which determine
the operability of the NV-SRAM bitcell. For the OxRAM device used in the design,
the pulse width of the write cycle is taken as 1 µs. Programmed resistance state
based storage in the proposed 4T-2R NV-SRAM depends on the magnitude and
polarity of VT B . Impact of the peak amplitude of BL (keeping PL fixed at 1.6 V)
on OxRAM device switching is shown in Fig. 2.7a, b. The potential drop across
the OxRAM devices (VT B ) is affected as the slope of the data signal BL is varied.
As the maximum amplitude (Vdatamax ) is increased, VT B (O x1) is decreased, which
signifies that a weaker programming condition is applied to the device. This results
in programming of the OxRAM devices Ox1 and Ox2 in different SET and RESET
resistance states.
The programming of OxRAM devices also depend on the peak voltage of PL as
shown in Fig. 2.7c, d. Since the previous programming state of the OxRAM devices
governs the subsequent programming conditions of the device (specifically RESET
state going to subsequent SET state), the OxRAM devices switches for varying
values of VT B (Fig. 2.7c). It can be observed that the RESET switching times remain
constant. This results in same initial condition for Ox2 (Fig. 2.7d). From this figure it
can be observed that the SET switching time of OxRAM increases with the amplitude
of PL. This is because more time is needed to build up the desired VT B across the
OxRAM terminals. It is observed from Fig. 2.7 that by modulating the slopes of
PL and BL signals, the latency of the 4T-2R NV-SRAM bitcell can be tuned in
single-cycle programming approach.
Furthermore, by varying the rise and fall times of the applied PL signal (i.e.
having an asymmetric triangular pulse) the programming time of the NV-SRAM
bitcell can be tuned. This is because the rise and fall times determines the rate at
which the potential drop across the device is developed to program the OxRAM
devices to SET/RESET states. Figure 2.6a gives a fair idea on the modulation of the
SET and RESET region of the OxRAM device for applied PL and BL/BLB signals.
For switching of the device from HRS → LRS, the state of the OxRAM is modulated
by varying the rise and fall times of the PL signal (Fig. 2.8a, b). A point to note here is
that the RESET operation of the device occurs during the rise time of PL (Fig. 2.6a).
40 S. Majumdar et al.
Fig. 2.7 Effect on VT B required for switching, due to change in peak amplitude of a, b BL (1–1.3
V) keeping peak amplitude of PL = 1.6 V, and c, d peak amplitude of PL (1.5–1.8 V) keeping peak
amplitude of BL = 1 V [25]
With reduction in rise time, the slope of PL increases. The VT B required for making
transition from LRS → HRS is achieved faster, thus reducing the switching time of
the device. Correspondingly, the device achieves a SET state faster as the fall time
of PL is increased. Figure 2.8c, d represents the variation in the resistance values and
switching times of Ox1/Ox2 with change in rise time of asymmetric PL signal.
An advantage of single-cycle programming scheme over double-cycle program-
ming is less energy required during write operation (≈80 fJ for HRS1-HRS2 as
compared to 1.8 pJ for LRS-HRS scheme [25]). The low energy is due to the fact
that the OxRAM devices stay in RESET region for 60% of the total programming
time during which a small amount of current flows through the device (∼nA). Fur-
thermore, the programming time of the single-cycle scheme is reduced by half as
compared to the two-cycle scheme making the single-cycle programming scheme an
energy and latency efficient approach.
The approach to read the programmed bitcell for both two-cycle and single-cycle
programming is same. To read the cell the bitlines are precharged to Vdd /2 which
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 41
Fig. 2.8 Change in VT B at switching instant, for a Ox1 and b Ox2 with change in the duration
of rising edge of PL voltage pulse. Change in the c switching time and d resistance of OxRAM
devices (after successful Write operation) by modulating the duration of rising edge of the pulse
applied at PL, keeping peak amplitude of BL signal at 1 V [25]
corresponds to the state ‘a’ in Fig. 2.5. Following that WL is asserted and a read
voltage is applied to PL (state ‘b’). Current flows through Ox1 and Ox2 depending on
the resistance state to which it is programmed. OxRAM device which is programmed
to a higher resistance value will allow less current to flow through it as compared to
OxRAM device programmed to a low resistance state. The current through the device
will charge or discharge the internal node and will, in turn, pull-up pr pull-down the
BL/BLB lines. This approach is similar to the read in a conventional SRAM cell.
The sense amplifier to differentiate the data written in the bitcell in such a case can
be a voltage control sense amplifier (VCLA). Another approach to read the bitcell is
to use read voltage to capture the current through the device. In such case, we use a
current controlled sense amplifier (CCSA). In this scheme, a read voltage (Vr ead ) is
applied to the PL and current corresponding to the resistive state flows through the
device. Since WL is asserted, the current flows through the BL/BLB lines, which is
converted to voltage levels by the sense amplifier enabling data read from bitcell.
The advantage of such a read scheme is that there is no need of a precharge circuit
to precharge the bitlines. This reduces the area overhead of the overall NV-SRAM
array.
42 S. Majumdar et al.
Conventionally, stability of SRAM bitcell is defined using SNM [58]. SNM is the
maximum value of DC noise voltage Vn that can be tolerated by the memory bitcell
without changing the logic state. For 4T-2R NV-SRAM bitcell (at 90 nm technology
node), the hold, read and write SNM are 0.3 V, 0.13 V and 0.42 V respectively.
For a SRAM bitcell with cell ratio (CR) 2 and pull-up ratio (PR) 1, the hold, read
and write SNM values are 0.5 V, 0.15 V and 0.5 V respectively [57]. Figure 2.9a–c
shows the effect of Vdd scaling on hold, read and write SNM for 4T-2R NV-SRAM.
Figure 2.9d–f show the hold, read and write SNM curves for 4T-2R NV-SRAM with
pull-down transistor width (M1 and M2) in range 200 nm–2 µm. The width of M3/M4
is kept constant at 180 nm. It is observed that read SNM is a strong function of CR.
For lower CR values, Read operation fails, hence for reliable Read operation CR
needs to be equal to, or greater than 2.2. Furthermore it is observed for successful
Write operation, pull-down transistor (M1 and M2) width of 200 nm (CR ≈ 1.11)
is desirable, however due to destructive Read operation bitcell needs to be designed
with CR ≥ 2.2 [57].
2.3.2.2 N-curve
It is evident that SNM considers only the voltage matrices of SRAM/NV-SRAM cell
to analyze the bitcell stability. N-curve method [59], which considers both voltage
and current matrices, gives the following stability matrices—SVNM (static voltage
noise margin), SINM (static current noise margin), WTV (write-trip voltage) and
WTI (write-trip current). Read stability criteria is defined using SVNM and SINM.
A small SVNM combined with a large SINM (or vice versa) results in a stable cell
because the Vn required to disturb the cell is large [59]. Table 2.2 summarizes N-curve
parameters calculated for 6T SRAM and 4T-2R NV-SRAM [57]. By modulating the
pull-down transistor width (i.e. by changing CR) of the NV-SRAM cell and Vdd
amplitude, N-curve characteristics are plotted (shown in Fig. 2.9g–i). It is observed
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 43
Fig. 2.9 Simulated 4T-2R NV-SRAM bitcell—a, d Hold SNM b, e Read SNM and c, f Write
SNM, for different Vdd values and different pull-down transistor widths respectively. N-curves for
6T SRAM and 4T-2R NV-SRAM are shown in g and impact of h different Vdd and i pull-down
transistor widths on N-curves of 4T-2R NV-SRAM bitcell [57]
Table 2.2 N-curve parameters for 6T SRAM and 4T-2R NV-SRAM bitcell [57]
Parameters SVNM (mV) SINM (µA) WTV (mV) WTI (µA)
6T SRAM 489.9 113.1 745.8 65.8
4T-2R NV-SRAM 431.7 134.7 709.56 66.4
that with increasing Vdd and pull-down transistor size, there is improvement in SINM,
WTI and WTV, while SVNM remains almost constant.
It is evident that 4T-2R NV-SRAM bitcell offers numerous advantages over
other NV-SRAM designs proposed in literature, such as (i) real-time nonvolatility,
(ii) permits unconventional transistor sizing, (iii) low area footprint and (iv) low-
power operation. For quantitative comparison, we have presented in Table 2.3 the
comparison of 4T-2R NV-SRAM bitcell with other NV-SRAM implementations
proposed in literature so far.
44 S. Majumdar et al.
Table 2.3 Comparison of different 4T-2R NV-SRAM bitcells and state-of-the art 6T SRAM bitcell
[25]
Parameters 4T-2MTJ 4T-2R 4T-2R 4T-2R 6T SRAM
[27] [60] [24] [25] [61, 62]
Volatility NV NV NV NV V
NV device* MTJ STI-OxRAM OxRAM OxRAM –
Tech. node** 32 nm (Sim.); 40 nm (Fab.) 90 nm (Sim.) 90 nm (Sim.) 10 nm (Fab.)
90 nm (Fab.)
Prog. Two step Two step Two step One step One step
Vdd (V) 1 2.8 1.6 1.2 1.6 0.6
Write time 25 ns 5 µs 2 µs 2 µs 1 µs 0.6 ns
Pull-down 3 µm – 640 nm 240 nm 200 nm 70 nm
transistor
size
R L RS () 1k 20–400 k 268 k 0.68 M 264 k –
R H RS () 2k ≈2 M 2.04 M 2.04 M 2.04 M –
Switching 400 µA 100 µA 2.8 µA 456 nA 2.7 µA –
current
SNM (mV) 340 258 250 300 200
*NV Nonvolatile, V Volatile;
**Fab. Fabricated, Sim. Simulated
Figure 2.10 shows the schematic of the proposed real-time NV-FF. The circuit uses
four OxRAM devices to store the data in real-time when it is transferred from D to
Q. The major advantages of this NV-FF are:
• The circuit is implemented in a small area as compared to both the conventional
CMOS based FF and off-loading-based NV-FF.
• The circuit offers zero leakage current during off-state of the NV-FF.
• The circuit takes care of power glitches during active/normal operating mode, that
may cause the data to be corrupted.
• The circuit is easy to design and replaces the PMOS transistors in the conventional
CMOS based FF and NV-FF. Thus cost effective.
The proposed NV-FF design consists of two modules—(1) master block (2) slave
block. Unlike traditional NV-FF which has 3 operating modes (active, store and
restore modes), the proposed NV-FF has only two operating modes—active/normal
mode (which also stores the data to the nonvolatile device) and restore mode.
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 45
Fig. 2.10 Schematic of the real-time NV-FF. It is similar to the conventional CMOS based FF in
terms of modules constituting it—Master Block and Slave Block
0 turns off the transistor M4 and therefore, the OxRAM connected to it slowly
charges the internal node capacitor C L2 to ‘1’. The programming of the OxRAM
devices in the slave block is same as that in the master block.
A point to note here is that there is no external control signal which moni-
tors/triggers the data off-loading. This reduces the number of external connections
to the NV-FF thus easing routing and pin/terminal congestion.
2. Power-down Mode: During the power-down mode, all the signals are pulled down
to zero and the FF goes to a standby mode. Since the nonvolatile devices store
the data as its resistive state, it is not lost and can be restored when the NV-FF
comes back to power.
3. Restore Mode: A CLK = 0 and PL = 1 are asserted when the NV-FF block is
turned on. This allows a current to flow through OxRAM devices (connected to
M3 and M4) depending on the resistive state to which they are programmed. The
OxRAM device connected to M4 (programmed to a SET state) charges the gate
of transistor M3 to a logic ‘1’ turning it on. This leads to the discharge of internal
node capacitor C L1 in the slave block to logic ‘0’ restoring the data at the output
Q of the NV-FF.
We can observe here that the data off-loading occurs at the normal mode but only
the OxRAM devices in the slave block participate in the data restoring. In addition
to this, the NV-FF in this case is slower than the conventional NV-FF since the total
time to transfer the data to the output (T D−Q ) is equal to ≈2× the programming
time of the OxRAM device. The total time needed to transfer and store the data in
real-time is:
TD−Q = Tmaster + Tslave (2.1)
TD−Q = 2T (2.4)
Therefore, the performance of the NV-FF in this case heavily depends on the pro-
gramming time of the OxRAM device. As the technology improves, faster OxRAM
devices are being proposed. Therefore, such a design proves to be beneficial in terms
of area and performance.
Figure 2.11 shows the timing diagram of the 4T-2R-based NV-FF for real-time
data storage. The transistors used in this simulation is from the 90 nm technology
node and the OxRAM model is the same as described in [56]. The FF operates
at 1.6 V. For the device model used in the simulation T R E S E T ≥ T S E T [25], thus,
Tstor e = 714 ns and Tr estor e = 2 ns. Since the write current of the OxRAM is small
(2.3 µA for programming OxRAM in LRS and 364 nA for programming OxRAM in
HRS), the transistor sized used in the latch can be kept to minimum standard sizing
without any additional parasitics.
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 47
Fig. 2.11 Timing diagram of the 4T-2R-based NV-FF. When CLK = 0, the OxRAM in master
block gets programmed and when CLK = 1, OxRAM in slave block gets programmed. R1 (feed-
forward) and R2 (feedback) are master block OxRAM, R3 (feed-forward) and R4 (feedback) are
slave block OxRAM
Due to the limitations posed by the NV-FF design proposed in previous section, a
modified NV-FF design is presented here. The proposed NV-FF, as shown in Fig. 2.12,
has three different modes of operation: (1) active or normal mode, (2) off-loading
or store mode and (3) restore mode. The NV-FF consists of a volatile master stage
and a single OxRAM device in the slave stage. This device stores the off-loaded
data just before power-down mode is activated. A small area overhead is required
48 S. Majumdar et al.
Fig. 2.12 Schematic of the proposed modified NV-FF depicting the 3 major operating blocks
for the proposed NV-FF (6 extra transistors in addition to the 22 transistors needed
for conventional CMOS based FFs). Only the slave latch is employed to write/read
the OxRAM device during the off-loading/restore mode without the need for any
sensing or dedicated write driver block. STR (store) and RSTR (restore) signals
are asserted such that only one signal is activated during store/restore operation.
Figure 2.13 shows the schematic of the off-loading block of the proposed NV-FF. We
can see that the block is essentially made up of three separate modules.
1. Nonvolatile Block: This block stores the data that is to be off-loaded from the
output node Q. When Q = 0, the OxRAM in this block is programmed to HRS
and when Q = 1, the OxRAM in this block is programmed to LRS.
2. Control Block: This block controls the operation being performed by the off-
loading section. It consists of a simple OR gate with two inputs: STR and RSTR.
Table 2.4 shows the operation performed by the off-loading block according to
the input combinations of the STR and RSTR signals. It is to be noted that STR
and RSTR can never be ‘1’ at the same time.
3. Data Generation Block: This block along with the control block off-loads the
data or restores the data to the output node Q when a STR or RSTR signal is
applied. The block is mainly responsible for the following two tasks:
a. Provide write data voltage (VW R ) when data is being off-loaded.
b. Provide read data voltage (V R D ) when the data is being read to restore the
data.
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 49
Fig. 2.13 The off-loading block showing the three sub-blocks. The data is off-loaded during con-
trolled power-down in a single OxRAM device. The control clock is a simple OR gate controlled
by two control signals STR and RSTR. The data generation block is used to program the device
during data off-loading and providing supply during restore operation
For the proposed circuit VW R is taken as 1.6 V and V R D is taken as 0.4 V. The
read voltage has to be chosen such that the internal state of the OxRAM is not
disturbed during data read.
1. Active/Normal Mode: During active or normal mode, both the STR and RSTR
signals are held at logic ‘0’ and the terminals of the OxRAM device are grounded
(VT B = 0). Therefore the OxRAM device does not participate in the normal FF
operation. When CLK = 0, the master stage latches the input from data line D.
When CLK = 1, the feedback path in the master stage holds the last sampled
50 S. Majumdar et al.
value and the data is transferred to the slave stage. The operation continues till
a power-down/sleep-mode is activated.
2. Store Mode: In this mode the control signals, STR = 1 and RSTR = 0, are
asserted to off-load the data to the OxRAM device. This leads to a logic ‘1’ output
at the control block thereby switching on the two transistors in nonvolatile block
and the data generation block respectively (refer Fig. 2.14). Since RSTR = 0,
both the multiplexers in the data generation block selects the first input to assert
its value at the output. Therefore, on one hand VW R is chosen which provides the
data write voltage of 1.6 V to the power supply of the inverter, on the other hand
the data which is to be off-loaded is provided at the input of the inverter through
the other MUX. It can be seen that the output of the inverter is opposite to the
data value being stored. This makes sure that the polarity of the voltage applied
to the OxRAM is properly maintained. When Q = 0, TE is at lower potential
and BE is at higher potential (VT B < 0), therefore the OxRAM is programmed
to HRS. Similarly, when Q = 1, TE is at higher potential than BE (VT B > 0)
and therefore the OxRAM is programmed to LRS. After the data is written to
the OxRAM block, the power-down signal is asserted to switch off the NV-FF.
It is to be noted here that the FF has to wait till the OxRAM is programmed so
as to avoid any kind of data corruption during off-loading.
3. Power-Down/Sleep-Mode: In sleep-mode, all the data and the control signals
are pulled down. Since the OxRAM device stores the data as its resistive state,
the data off-loaded to it remains stored. As the system is fully switched off, the
leakage current of this block is negligible.
4. Restore Mode: During the restore mode, CLK = 0 and RSTR = 1 are asserted
thereby switching ON the transistors in the nonvolatile and the data generation
Fig. 2.14 Store operation in the proposed modified NV-FF. Green data shows the polarities and
data values at off-loading block circuit nodes when Q = 0 is to be stored in OxRAM. Blue data
shows the polarities and the values at off-loading block circuit nodes when Q = 1 is to be stored.
The red data shows the control block signals
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 51
blocks. The data block provides a read voltage (V R D = 0.4 V) to the OxRAM
which restores the data in the slave latch. When the OxRAM is in SET state,
the internal node Qm charges to a logic ‘1’. The action of charging node Qm is
supported by the inverters in the feedback circuit. Due to the presence of inverter
between the data store/restore node Qm and output node Q, the original data is
restored at the output of the FF. Similar steps are followed when logic ‘1’ is
restored from OxRAM. Figure 2.15 shows the restore operation in the propose
FF circuit.
Fig. 2.15 Restore operation in the proposed modified NV-FF. Here V R D provides the read voltage
that is needed to read the state stored in the OxRAM. The Ir ead current obtained from the OxRAM
charges the load capacitance at the output to a logic value ‘1’ or ‘0’ depending on the state of the
OxRAM (LRS or HRS)
Fig. 2.16 Timing diagram showing different operating modes of the modified NV-FF. The off-
loading is controlled by the control signals STR and RSTR
52
Table 2.5 Comparison of various NV-FF designs proposed in literature and the proposed NV-FF designs
Parameters [3] [31] [32] [51] [54] 4T-2R NV-FF NV-FF w/data
off-load
NV Device MTJ OxRAM STT/SHE OxRAM Ferroelectric OxRAM OxRAM
capacitors
Simulated/fabricated Sim. Fab. Fab. Sim. Fab. Sim. Sim.
Technology node (nm) 22 65 150 180 130 90 90
Programming voltage 1.1 1 1.5 2 1.5 1.6 1.6
(V)
Store delay − 4 µsa 50 µsb − 170 ns 714 ns 357 ns
Restore delay − 16 nsa 20 µsb − 160 ns 2 ns 2 ns
Store energy 0.57 pJ 46.2 pJ − 735 fJc 2.4 pJ 186 pJ 93 pJ
Restore energy 58 fJ 9.2 fJ − 735 fJc 2.34 pJ 0.4 pJ 0.4 pJ
Sim. Simulated, Fab. Fabricated
a Per 1000 NV-FFs
b For 1000 instructions in per operational clock cycle of chip
c For 0.8 V
S. Majumdar et al.
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 53
The timing diagram of the simulated NV-FF block is shown in Fig. 2.16 with
different operating modes. The simulations were based on the OxRAM model as
given in [56] and 90 nm technology node. The flip-flop operates at 1.6 V. Data is
stored in minimum Tstor e = 15 ns and restored in minimum Tr estor e = 2 ns. The
NV-FF has a backup energy of 3.08 pJ and restore energy of 0.4 pJ. Since the write
current of the OxRAM is small (2.3 µA for programming OxRAM in LRS and 364
nA for programming OxRAM in HRS), the transistor sized used in the latch can be
kept to minimum standard sizing without any additional parasitics.
Table 2.5 shows the comparison of the proposed NV-FFs with other NV-FF designs
present in literature. The NV-FF considered in this table ranges from data off-loading
in OxRAM, Ferroelectric capacitors to MTJ based devices. While conventional NV-
FF rely on serial [36, 40, 41, 55, 63] or two-phase writing [38, 39, 42] during data
off-loading, the proposed NV-FF uses a single NV device which works on parallel
data writing. This drastically reduces the access time of the NV-FF and the overall
energy of the circuit.
2.5 Conclusion
In this chapter, we have presented a real-time 4T-2R NV-SRAM bitcell using HfOx
based OxRAM devices. We have explained its different operational modes (i.e. Write
mode and Read mode) along with multiple programming approaches (Two-cycle and
Single-cycle programming schemes). Since stability of NV-SRAM bitcell has been a
concern, we presented a detailed analysis summarizing the impact of Vdd scaling and
transistor down-scaling on the stability metrics (SNM and N-curve). It is observed
that using 4T-2R NVSRAM there is a possibility of transistor down-scaling and
lower switching current enables low-power circuit design. We further extended the
scope of 4T-2R NV-SRAM bitcell by proposing a real-time NV-FF using it. We also
discussed the shortcomings of having OxRAM device actively participating in the
normal operation of NV-FF and proposed a modified NV-FF design to mitigate the
issues. Although the major challenge pertaining to the design of NV-SRAM and
NV-FF is to take care of the abrupt power glitches, active participation of OxRAM
device slows down the overall circuit. We believe, with advancement in the material
science engineering, this challenge will be addressed. Some developmental works
by [10, 64–69] give us an idea that the development in this area has started picking
up. Thus in days to come, OxRAM based real-time designs will be not only be area
and power efficient but show better performance in terms of latency and energy.
References
1. J. Abouei, J.D. Brown, K.N. Plataniotis, S. Pasupathy, Energy efficiency and reliability in
wireless biomedical implant systems. IEEE Trans. Inf. Technol. Biomed. 15(3), 456–466 (2011)
2. A.C.K. Chan, S. Okochi, K. Higuchi, T. Nakamura, H. Kitamura, J. Kimura, T. Fujita, K.
Maenaka, Low power wireless sensor node for human centered transportation system, in 2012
54 S. Majumdar et al.
IEEE International Conference on Systems, Man, and Cybernetics (SMC) (IEEE, 2012), pp.
1542–1545
3. A.S. Iyengar, S. Ghosh, J.-W. Jang, MTJ-based state retentive flip-flop with enhanced-scan
capability to sustain sudden power failure. IEEE Trans. Circuits Syst. I: Regul. Pap. 62(8),
2062–2068 (2015)
4. T. Lin, K.-S. Chong, B.-H. Gwee, J.S. Chang, Fine-grained power gating for leakage and
short-circuit power reduction by using asynchronous-logic, in IEEE International Symposium
on Circuits and Systems, 2009. ISCAS 2009 (IEEE, 2009), pp. 3162–3165
5. S. Onkaraiah, M. Reyboz, F. Clermidy, J.-M. Portal, M. Bocquet, C. Muller, C. Anghel, A.
Amara et al., Bipolar reram based non-volatile flip-flops for low-power architectures, in 2012
IEEE 10th International New Circuits and Systems Conference (NEWCAS) (IEEE, 2012), pp.
417–420
6. S.K. Thirumala, A. Raha, H. Jayakumar, K. Ma, V. Narayanan, V. Raghunathan, S.K. Gupta,
Dual mode ferroelectric transistor based non-volatile flip-flops for intermittently-powered sys-
tems, in Proceedings of the International Symposium on Low Power Electronics and Design
(ACM, 2018), p. 31
7. P.-F. Chiu, M.-F. Chang, W. Che-Wei, C.-H. Chuang, S.-S. Sheu, Y.-S. Chen, M.-J. Tsai, Low
store energy, low VDDmin, 8T2R nonvolatile latch and SRAM with vertical-stacked resistive
memory (memristor) devices for low power mobile applications. IEEE J. Solid-State Circuits
47(6), 1483–1496 (2012)
8. M. Ueki, K. Takeuchi, T. Yamamoto, A. Tanabe, N. Ikarashi, M. Saitoh, T. Nagumo, H. Suna-
mura, M. Narihiro, K. Uejima et al., Low-power embedded ReRAM technology for IoT appli-
cations, in 2015 Symposium on VLSI Technology (VLSI Technology) (IEEE, 2015), pp. T108–
T109
9. I.G. Baek, C.J. Park, H. Ju, D.J. Seong, H.S. Ahn, J.H. Kim, M.K. Yang, S.H. Song, E.M. Kim,
S.O. Park et al., Realization of vertical resistive memory (VRRAM) using cost effective 3D
process. In 2011 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2011), pp.
31–38
10. S. Yu, H.-Y. Chen, B. Gao, J. Kang, H.-S.P. Wong, HfOx -based vertical resistive switching ran-
dom access memory suitable for bit-cost-effective three-dimensional cross-point architecture.
ACS nano 7(3), 2320–2325 (2013)
11. D. Ielmini, Resistive switching memories based on metal oxides: mechanisms, reliability and
scaling. Semicond. Sci. Technol. 31(6), 063002 (2016)
12. X. Dong, N.P. Jouppi, Y. Xie, A circuit-architecture co-optimization framework for evaluat-
ing emerging memory hierarchies, in 2013 IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS) (IEEE, 2013), pp. 140–141
13. X. Xue, W. Jian, Y. Xie, Q. Dong, R. Yuan, Y. Lin, Novel RRAM programming technology
for instant-on and high-security FPGAs, in 2011 IEEE 9th International Conference on ASIC
(ASICON) (IEEE, 2011), pp. 291–294
14. P.-F. Chiu, M.-F. Chang, S.-S. Sheu, K.-F. Lin, P.-C. Chiang, C.-W. Wu, W.-P. Lin, C.-H. Lin,
C.-C. Hsu, F.T. Chen et al., A low store energy, low vddmin, nonvolatile 8T2R SRAM with 3d
stacked RRAM devices for low power mobile applications, in 2010 IEEE Symposium on VLSI
Circuits (VLSIC) (IEEE, 2010), pp. 229–230
15. Y. Zheng, P. Huang, H. Li, X. Liu, J. Kang, G. Du, Simulation of the RRAM based nonvolatile
SRAM cell, in 2014 12th IEEE International Conference on Solid-State and Integrated Circuit
Technology (ICSICT) (IEEE, 2014), pp. 1–3
16. S. Yamamoto, S. Sugahara, Nonvolatile static random access memory using magnetic tun-
nel junctions with current-induced magnetization switching architecture. Jpn. J. Appl. Phys.
48(4R), 043001 (2009)
17. A.M.S. Tosson, A. Neale, M. Anis, L. Wei, 8T1R: A novel low-power high-speed RRAM-
based non-volatile SRAM design, in 2016 International Great Lakes Symposium on VLSI
(IEEE, 2016), pp. 239–244
18. S.-S. Sheu, C.-C. Kuo, M.-F. Chang, P.-L. Tseng, L. Chih-Sheng, M.-C. Wang, C.-H. Lin, W.-P.
Lin, T.-K. Chien, S.-H. Lee et al., A reram integrated 7T2R non-volatile SRAM for normally-off
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 55
computing application, in 2013 IEEE Asian Solid-State Circuits Conference (A-SSCC) (IEEE,
2013), pp. 245–248
19. M. Takata, K. Nakayama, T. Izumi, T. Shinmura, J. Akita, A. Kitagawa, Nonvolatile SRAM
based on phase change, in 2006 21st IEEE Non-Volatile Semiconductor Memory Workshop,
NVSMW (IEEE, 2006), pp. 95–96
20. W. Wei, K. Namba, J. Han, F. Lombardi, Design of a nonvolatile 7T1R SRAM cell for instant-on
operation. IEEE Trans. Nanotechnol. 13(5), 905–916 (2014)
21. A. Lee, M.-F. Chang, C.-C. Lin, C.-F. Chen, M.-S. Ho, C.-C. Kuo, P.-L. Tseng, S.-S. Sheu,
T.-K. Ku, RRAM-based 7T1R nonvolatile SRAM with 2x reduction in store energy and 94x
reduction in restore energy for frequent-off instant-on applications, in 2015 Symposium on
VLSI Technology (VLSI Technology) (IEEE, 2015), pp. C76–C77
22. W. Wang, A. Gibby, Z. Wang, T. W. Chen, S. Fujita, P. Griffin, Y. Nishi, S. Wong, Nonvolatile
SRAM cell, in 2006 International Electron Devices Meeting, December 2006 (2006), pp. 1–4
23. K. Abe, Hierarchical nonvolatile memory with perpendicular magnetic tunnel junctions for
normally-off computing, in International conference on solid state devices and materials
(SSDM 2010) (Tokyo, Japan, 2010), p. 2010
24. S. Majumdar, S.K. Kingra, M. Suri, M. Tikyani, Hybrid CMOS-OxRAM based 4T-2R NVS-
RAM with efficient programming scheme, in 2016 16th Non-Volatile Memory Technology
Symposium (NVMTS) (IEEE, 2016), pp. 1–4
25. S. Majumdar, S.K. Kingra, M. Suri, Programming scheme based optimization of hybrid 4T-2R
OXRAM NVSRAM. Semicond. Sci. Technol. 32(9), 094008 (2017)
26. T. Ohsawa, H. Koike, S. Miura, H. Honjo, K. Tokutome, S. Ikeda, T. Hanyu, H. Ohno, T.
Endoh, 1 Mb 4T-2MTJ nonvolatile STT-RAM for embedded memories using 32b fine-grained
power gating technique with 1.0 ns/200ps wake-up/power-off times, in 2012 Symposium on
VLSI Circuits (VLSIC) (IEEE, 2012), pp. 46–47
27. T. Ohsawa, H. Koike, S. Miura, H. Honjo, K. Kinoshita, S. Ikeda, T. Hanyu, H. Ohno, T. Endoh,
A 1 Mb nonvolatile embedded memory using 4T2MTJ cell with 32 b fine-grained power gating
scheme. IEEE J. Solid-State Circuits 48(6), 1511–1520 (2013)
28. W. Robinett, M. Pickett, J. Borghetti, Q. Xia, G.S. Snider, G. Medeiros-Ribeiro, A memristor-
based nonvolatile latch circuit. Nanotechnology 21(23), 235203 (2010)
29. D. Chabi, W. Zhao, E. Deng, Y. Zhang, N.B. Romdhane, J.-O. Klein, C. Chappert, Ultra low
power magnetic flip-flop based on checkpointing/power gating and self-enable mechanisms.
IEEE Trans. Circuits Syst. I: Regul. Pap. 61(6), 1755–1765 (2014)
30. I. Kazi, P. Meinerzhagen, P.-E. Gaillardon, D. Sacchetto, Y. Leblebici, A. Burg, G. De Micheli,
Energy/reliability trade-offs in low-voltage reram-based non-volatile flip-flop design. IEEE
Trans. Circuits Syst. I: Regul. Pap. 61(11), 3155–3164 (2014)
31. A. Lee, C.-P. Lo, C.-C. Lin, W.-H. Chen, K.-H. Hsu, Z. Wang, S. Fang, Z. Yuan, Q. Wei,
Y.-C. King et al., A reram-based nonvolatile flip-flop with self-write-termination scheme for
frequent-off fast-wake-up nonvolatile processors. IEEE J. Solid-State Circuits 52(8), 2194–
2207 (2017)
32. N. Sakimura, T. Sugibayashi, R. Nebashi, N. Kasai, Nonvolatile magnetic flip-flop for standby-
power-free socs. IEEE J. Solid-State Circuits 44(8), 2244–2250 (2009)
33. W. Zhao, E. Belhaire, C. Chappert, Spin-MTJ based non-volatile flip-flop, in 2007 7th IEEE
Conference on Nanotechnology (IEEE NANO) (IEEE, 2007), pp. 399–402
34. S. Yamamoto, Y. Shuto, S. Sugahara, Nonvolatile delay flip-flop using spin-transistor architec-
ture with spin transfer torque mtjs for power-gating systems. Electron. Lett. 47(18), 1027–1029
(2011)
35. K.-W. Kwon, S.H. Choday, Y. Kim, X. Fong, S.P. Park, K. Roy, SHE-NVFF: Spin hall effect-
based nonvolatile flip-flop for power gating architecture. IEEE Electron Device Lett. 35(4),
488–490 (2014)
36. W. Zhao, E. Belhaire, C. Chappert, F. Jacquet, P. Mazoyer, New non-volatile logic based on
spin-MTJ. Phys. Status Solidi (A) 205(6), 1373–1377 (2008)
37. K. Ryu, J. Kim, J. Jung, J.P. Kim, S.H. Kang, S.-O. Jung, A magnetic tunnel junction based
zero standby leakage current retention flip-flop. IEEE Trans. Very Large Scale Integr. (VLSI)
Syst. 20(11), 2044–2053 (2012)
56 S. Majumdar et al.
38. K. Huang, Y. Lian, A low-power low-vdd nonvolatile latch using spin transfer torque MRAM.
IEEE Trans. Nanotechnol. 12(6), 1094–1103 (2013)
39. G. Prenat, K. Jabeur, G. Di Pendina, O. Boulle, G. Gaudin, Beyond STT-MRAM, spin orbit
torque ram SOT-MRAM for high speed and high reliability applications, Spintronics-Based
Computing (Springer, Berlin, 2015), pp. 145–157
40. P. Wang, X. Chen, Y. Chen, H. Li, S. Kang, X. Zhu, W. Wu, A 1.0 V 45nm nonvolatile magnetic
latch design and its robustness analysis, in 2011 IEEE Custom Integrated Circuits Conference
(CICC) (IEEE, 2011), pp. 1–4
41. Y. Jung, J. Kim, K. Ryu, J.P. Kim, S.H. Kang, S.-O. Jung, An MTJ-based non-volatile flip-flop
for high-performance SoC. Int. J. Circuit Theory Appl. 42(4), 394–406 (2014)
42. K. Jabeur, G. Di Pendina, F. Bernard-Granger, G. Prenat, Spin orbit torque non-volatile flip-flop
for high speed and low energy applications. IEEE Electron Device Lett. 35(3), 408–410 (2014)
43. Y. Wang, Y. Liu, S. Li, D. Zhang, B. Zhao, M.-F. Chiang, Y. Yan, B. Sai, H. Yang, A 3us
wake-up time nonvolatile processor based on ferroelectric flip-flops, in 2012 Proceedings of
the ESSCIRC (ESSCIRC) (IEEE, 2012), pp. 149–152
44. S. Masui, W. Yokozeki, M. Oura, T. Ninomiya, K. Mukaida, Y. Takayama, T. Teramoto, Design
and applications of ferroelectric nonvolatile SRAM and flip-flop with unlimited read/program
cycles and stable recall, in Proceedings of the IEEE 2003 Custom Integrated Circuits Confer-
ence, 2003 (IEEE, 2003), pp. 403–406
45. M. Qazi, A. Amerasekera, A.P. Chandrakasan, A 3.4-pJ feram-enabled D flip-flop in 0.13-
um CMOS for nonvolatile processing in digital systems. IEEE J. Solid-State Circuits 49(1),
202–211 (2014)
46. D. Wang, S. George, A. Aziz, Suman Datta, Vijaykrishnan Narayanan, and Sumeet K Gupta.
Ferroelectric transistor based non-volatile flip-flop, in Proceedings of the 2016 International
Symposium on Low Power Electronics and Design (ACM, 2016), pages 10–15
47. S. Shigematsu, S. Mutoh, Y. Matsuya, J. Yamada, A 1-v high-speed MTCMOS circuit scheme
for power-down applications, in VLSI Circuits, 1995. Digest of Technical Papers., 1995 Sym-
posium on (IEEE, 1995), pp. 125–126
48. Y. Jung, J. Kim, K. Ryu, S.-O. Jung, J.P. Kim, S.H. Kang, MTJ based non-volatile flip-flop in
deep submicron technology, in 2011 International SoC Design Conference (ISOCC) (IEEE,
2011), pp. 424–427
49. S. Yamamoto, Y. Shuto, S. Sugahara, Nonvolatile flip-flop using pseudo-spin-transistor archi-
tecture and its power-gating applications, in 2012 International Semiconductor Conference
Dresden-Grenoble (ISCDG) (IEEE, 2012), pp. 17–20
50. T. Endoh, T. Ohsawa, H. Koike, T. Hanyu, H. Ohno, Restructuring of memory hierarchy in com-
puting system with spintronics-based technologies, in 2012 Symposium on VLSI Technology
(VLSIT) (IEEE, 2012), pp. 89–90
51. I. Kazi, P. Meinerzhagen, P.-E. Gaillardon, D. Sacchetto, A. Burg, G. De Micheli, A ReRAM-
based non-volatile flip-flop with sub-VT read and CMOS voltage-compatible write, in 2013
IEEE 11th International New Circuits and Systems Conference (NEWCAS) (IEEE, 2013), pp.
1–4
52. W. Kang, Y. Ran, W. Lv, Y. Zhang, W. Zhao, High-speed, low-power, magnetic non-volatile
flip-flop with voltage-controlled, magnetic anisotropy assistance. IEEE Magn. Lett. 7, 1–5
(2016)
53. R. Bishnoi, F. Oboril, M.B. Tahoori, Non-volatile non-shadow flip-flop using spin orbit torque
for efficient normally-off computing, in 2016 21st Asia and South Pacific Design Automation
Conference (ASP-DAC) (IEEE, 2016), pp. 769–774
54. S. Izumi, H. Kawaguchi, M. Yoshimoto, H. Kimura, T. Fuchikami, K. Marumoto, Y. Fujimori,
A ferroelectric-based non-volatile flip-flop for wearable healthcare systems, in 2015 15th Non-
Volatile Memory Technology Symposium (NVMTS) (IEEE, 2015), pp. 1–4
55. K. Ali, F. Li, S.Y.H. Lua, C.-H. Heng, Compact spin transfer torque non-volatile flip flop design
for power-gating architecture, in 2016 IEEE Asia Pacific Conference on Circuits and Systems
(APCCAS) (IEEE, 2016), pp. 119–122
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 57
56. H. Li, Z. Jiang, P. Huang, Y. Wu, H.-Y. Chen, B. Gao, X.Y. Liu, J.F. Kang, H.-S.P. Wong,
Variation-aware, reliability-emphasized design and optimization of RRAM using spice model,
in 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE) (IEEE, 2015),
pp. 1425–1430
57. S.K. Kingra, S. Majumdar, M. Suri, Stability analysis of hybrid CMOS-RRAM based 4T-2R
NVSRAM, in 2017 15th IEEE International New Circuits and Systems Conference (NEWCAS)
(IEEE, 2017), pp. 125–128
58. E. Seevinck, F.J. List, J. Lohstroh, Static-noise margin analysis of MOS SRAM cells. IEEE J.
Solid-State Circuits 22(5), 748–754 (1987)
59. E. Grossar, M. Stucchi, K. Maex, W. Dehaene, Read stability and write-ability analysis of sram
cells for nanometer technologies. IEEE J. Solid-State Circuits 41(11), 2577–2588 (2006)
60. C.-F. Liao, M.-Y. Hsu, Y.-D. Chih, J. Chang, Y.-C. King, C.J. Lin, Zero static-power 4T SRAM
with self-inhibit resistive switching load by pure CMOS logic process, in 2016 IEEE Interna-
tional Electron Devices Meeting (IEDM) (IEEE, 2016), pp. 16–5
61. T. Song, W. Rim, S. Park, Y. Kim, J. Jung, G. Yang, S. Baek, J. Choi, B. Kwon, Y. Lee et al.,
17.1 a 10nm FinFET 128Mb SRAM with assist adjustment system for power, performance,
and area optimization, in 2016 IEEE International Solid-State Circuits Conference (ISSCC)
(IEEE, 2016), pp. 306–307
62. M.-C. Chen, C.-H. Lin, Y.-F. Hou, Y.-J. Chen, C.-Y. Lin, F.-K. Hsueh, H.-L. Liu, C.-T. Liu,
B.-W. Wang, H.-C. Chen et al., A 10 nm Si-based bulk FinFETs 6T SRAM with multiple fin
heights technology for 25% better static noise margin, in 2013 Symposium on VLSI Technology
(VLSIT) (IEEE, 2013), pp. T218–T219
63. M.-F. Chang, C.-H. Chuang, M.-P. Chen, L.-F. Chen, H. Yamauchi, P.-F. Chiu, S.-S. Sheu,
Endurance-aware circuit designs of nonvolatile logic and nonvolatile SRAM using resistive
memory (memristor) device, in 2012 17th Asia and South Pacific Design Automation Confer-
ence (ASP-DAC) (IEEE, 2012), pp. 329–334
64. S.-S. Sheu, M.-F. Chang, K.-F. Lin, C.-W. Wu, Y.-S. Chen, P.-F. Chiu, C.-C. Kuo, Y.-S. Yang, P.-
C. Chiang, W.-P. Lin et al., A 4Mb embedded SLC resistive-RAM macro with 7.2 ns read-write
random-access time and 160ns mlc-access capability, in 2011 IEEE International Solid-State
Circuits Conference Digest of Technical Papers (ISSCC) (IEEE, 2011), pp. 200–202
65. S.-S. Sheu, P.-C. Chiang, W.-P. Lin, H.-Y. Lee, P.-S. Chen, Y.-S. Chen, T.-Y. Wu, F.T. Chen,
K.-L. Su, M.-J. Kao et al., A 5ns fast write multi-level non-volatile 1 k bits RRAM memory
with advance write scheme, in 2009 Symposium on VLSI Circuits (IEEE, 2009) pp. 82–83
66. M.-F. Chang, P.-F. Chiu, S.-S. Sheu, Circuit design challenges in embedded memory and
resistive RAM (RRAM) for mobile SoC and 3D-IC, in 2011 16th Asia and South Pacific
Design Automation Conference (ASP-DAC) (IEEE, 2011), pp. 197–203
67. J. Tranchant, E. Janod, L. Cario, B. Corraze, E. Souchier, J.-L. Leclercq, P. Cremillieu, P.
Moreau, M.-P. Besland, Electrical characterizations of resistive random access memory devices
based on GAV4S8 thin layers. Thin Solid Films 533:61–65 (2013)
68. H.-Y. Chen, B. Gao, H. Li, R. Liu, P. Huang, Z. Chen, B. Chen, F. Zhang, L. Zhao, Z. Jiang,
et al., Towards high-speed, write-disturb tolerant 3d vertical RRAM arrays, in 2014 Symposium
on VLSI Technology (VLSI-Technology): Digest of Technical Papers (IEEE, 2014), pp. 1–2
69. S.-Y. Wang, C.-H. Tsai, D.-Y. Lee, C.-Y. Lin, C.-C. Lin, T.-Y. Tseng, Improved resistive switch-
ing properties of Ti/ZrO/Pt memory devices for RRAM application. Microelectron. Eng. 88(7),
1628–1632 (2011)
Chapter 3
Phase Change Memory for Physical
Unclonable Functions
Abstract Security has become a crucial concern in hardware design due to the grow-
ing need for protection in everyday financial transactions and exchanges of private
information. Physical unclonable functions (PUFs) utilize the inevitable manufactur-
ing process variations to provide a unique way to verify trusted users. Improvements
in attack methods over the years have recently moved the field of PUFs from tradi-
tional silicon devices toward emerging nonvolatile resistive switching memories. Due
to the intrinsic programming variability in resistive switching memory mechanisms,
together with the high endurance of these devices, unpredictable and reconfigurable
PUF challenge-response pairs can be achieved for a very large number of times. In
the case of phase change memories (PCMs), cell-to-cell and cycle-to-cycle program-
ming variability is the result of the random atomic structures created after the rapid
quench from melt during the reset programming and the stochastic distribution and
orientation of seed crystals nucleated in an amorphous plug during the set operation.
This programming variability, which comes in addition to the process variations
present in any technology, is an important advantage of PCM (and other resistive
switching memory technologies) for implementations of PUFs and other hardware
security primitives. In this chapter, we review some of the work on conventional
CMOS-based PUFs, the operation principles of PCM devices, and recent reports on
PCM-based PUFs that utilize programming variability.
Security is an inherent need for human life. The security requirements in the current
age, however, have broadened beyond the necessity of protecting tangible property.
In the modern world of internet of things (IoT), countless physical devices, vehicles,
home appliances, medical wearables, factories, smart cities, and many other systems
are intricately being connected through sensors and software and are exchanging
data that must be secured [1]. CISCO has predicted an exponential increase of the
number of connected devices with an estimated 6.58 devices per person on average,
worldwide by 2020 [2]. Hence, any breach of security can lead to significant issues
in the lives of large numbers of individuals by leak of health-related, financial, and
private information and can cause severe economic loss and exposure of confiden-
tiality [3]. Therefore, comprehensive measures need to be taken to secure not only
the cyberspace but also the myriads of connected devices. In this chapter, we describe
conventional and novel methods of securing hardware devices.
In a security system, a securely stored secret key is used along with cryptographic
algorithms. The leak of a secret key means that the security of the system has been
broken [3]. The traditional mechanisms to store keys in devices included permanent
writing of the secret information to a battery-backed static random access memory
(SRAM) array or on a read-only memory (ROM), and using cryptographic operations,
such as digital signature or encryption [4]. The battery-backed SRAM is expensive
in terms of area and power requirement due to the volatile nature of the memory
operation [4] and suffers from limited reliability due to the possibility of battery
failure. Among the nonvolatile key storage solutions, the most common one is the
ROM based, in which masks generate the permanent keys during the manufacturing
stage and these are not erasable or modifiable in the post-manufacturing phase [3].
This technique requires new masks for each new key and thus prolongs the production
time. The major disadvantage of this scheme, however, is that the secret key is always
available in the permanent nonvolatile ROM, even when the device is powered off,
allowing opportunities for invasive or physical attacks. Recent advances in physical
attack techniques on electronic chips via fault analysis tools have made it easier to
produce fake security chips, which can serve as clones and continue to communicate
in the IoT environment. The most commonly used tools for invasive attacks are high-
resolution imaging with optical microscopy or scanning electron microscopy (SEM)
and using destructive measures such as a focused ion beam (FIB) or a laser cutter
to reverse engineer, precisely, layer-by-layer. In micro probing attacks, electrical
measurements can reveal the permanent secret key stored in the permanent memory
[3].
Other nonvolatile key storage solutions have been proposed using floating gate
technologies (flash memory) but their complex fabrication makes it impractical for
PUF application and these are still vulnerable to leak or manipulation of secret key
through micro probing attacks [3]. Incorporation of powered tamper-sensing and
tamper-proof circuits is required to detect or prevent invasive attacks, respectively,
at the cost of additional area and power [4].
The radio frequency identifier (RFID) tag is one of the commonly used products
that stores secret data permanently. The RFID tag also includes an antenna and
communicates with a reader. The reader has its own antenna, it interrogates an RFID
tag with a challenge data signal, and provides the required energy for the tag to
operate. The RFID tag transmits back a response signal incorporating the secret
information permanently stored in the memory. The reader later communicates with
3 Phase Change Memory for Physical Unclonable Functions 61
Data
Energy, clock
Host system RFID reader RFID tag
Fig. 3.1 Basic working principle of traditional passive radio frequency identification (RFID) sys-
tem [5]
the host computer for further processing of the communication and secret information
(Fig. 3.1). The information is hard programmed into the RFID memory chip during
manufacturing stages and cannot be erased or modified later. Physical attacks reveal
the secret information and give an adversary the opportunity to produce clone chips.
In addition, the RFID tag is also subject to leak information via eavesdropping, by
which an unauthorized reader listens to the communication between the legitimate
tag and reader to steal information or gain access. The attacker can also record one
part of the communication and conduct a replay attack on the receiving device at
a later time [5]. By observing the pattern of power consumption variations during
the correct and the incorrect passcodes, the attacker can conduct a power analysis
attack, which is a side-channel attack to retrieve secret information. The attacker
can also pursue a man-in-the-middle attack by blocking or manipulating the signal
communication path or carry out a denial-of-service attack by injecting noise and
interference into the network, in order to take the system down [5]. Hence, secret keys
should ideally be written with unclonable schemes and in sufficiently large numbers
to avoid physical attacks.
(a) Physical
Challenge unclonable Response
(C) function (R)
(PUF)
n.
R1 R2 R1 R2
Legitimate Tampered
f(C) Given Predict PUF PUF
C R=f(C) Challenge → correct
C f(C)
f-1(R)
Tamper
Fig. 3.2 a Basic working principle of the physical unclonable function (PUF). b–g Schematic
representations of essential features of PUF. Schematics redrawn from [7]
The relation between the challenges and responses or the CRP behavior should not
be easily realizable with mathematical functions and true physical randomness may
ensure this. Hence, a PUF is not a mere mathematical function but rather a procedure
with input–output functionality. Moreover, a PUF is not just an abstract concept
but rather has to be always implemented in a physical entity [6]. PUF CRPs can
be either analog or digital bit strings. For analog CRPs, several stages of decoding
and quantization are required to generate the digital bit string CRPs. A PUF system
should be easy and economical to implement yet very hard to clone. A PUF should
also be easily measurable within reasonable time, effort, power, or area [6].
There are several essential features that describe the behavior of a PUF, each described
briefly below (Fig. 3.2b–g):
i. Reliability or reproducibility:
The responses generated from the same PUF inquired by the same challenge should
always be very similar during multiple observations (Fig. 3.2b). This feature guar-
antees reproducibility of responses, and a dissimilarity in the responses generated
3 Phase Change Memory for Physical Unclonable Functions 63
to satisfy this requirement if an adversary, with access to the full PUF, can predict
the upcoming responses for given challenges based on the knowledge he gained
during observations of a set of previous CRPs. In this case, the adversary would
have succeeded to model the PUF system and the cloning would have broken the
unpredictable feature of the PUF [6].
vi. Tamper-evident:
A physical attack on the PUF system should permanently change its functionality
or leave indelible evidence in the device so that further measurements on the device
clearly indicate tampering (Fig. 3.2g) [7].
Based on the construction and operation principle, PUFs can be divided into the two
broad categories of non-electronic and electronic PUFs (Fig. 3.3).
PUFs
Non-
Electronic
electronic
Threshold Memory-
Delay-based Paper PUF
voltage PUF based
Ring
Coating PUF oscillator SRAM PUF PCM PUF RF PUF
PUF
Magnetic
LC PUF Latch PUF RRAM PUF
PUF
STT-MRAM
Butterfly PUF Acoustic PUF
PUF
Flip-flop PUF
Fig. 3.3 Proposed PUFs classified in terms of construction and working principle [6]
Designs for the integration of such systems into a chip have later been proposed with
the sequential arrangement of the light source array, the same disordered optical
medium, and the sensor array [10] (Fig. 3.4b, c).
Besides optical scattering from randomly distributed particles or shapes, other
sources of randomness that have been proposed for non-electronic PUFs include
(Fig. 3.3) the following:
i. the unique and random fiber structure of paper for forgery prevention (scanned
for measurement) [12],
ii. the random measured lengths of lands and pits on a regular compact disk (CD)
(measured by the electrical signal generated by photodiode inside the CD reader)
[13],
iii. the random positioning of thin copper wire within a silicon rubber sealant, (the
near-field scattering of electromagnetic waves (5–6 GHz band) was measured
with an RF antenna array) [14],
iv. the unique particle pattern in magnetic media of a swipe card [15],
66 N. Noor and H. Silva
ϴ1
Gabor
ϴ2
hash
… … CCD function
camera
Speckle 00101101…
pattern on
screen Responses
LASER in digital
orientation bit string
angle (ϴ) Optical
token
Light
scattering
LASER medium
Fig. 3.4 a Optical physical one-way function (POWF) [9] and b, c later design proposals for
integrated optical PUF. Schematics redrawn from [6] and [10]
iii. capacitance variation in comb-shaped sensors in the top metal layer of an inte-
grated circuit, where a passive dielectric spray containing random dielectric
elements is explicitly introduced (this PUF is also named as coating PUF) [19],
iv. the resonant frequency variation in identically designed LC circuits built by a
glass plate sandwiched between two metal plates, along with a serially chained
metal coil [20].
Digital PUFs output digital bit(s) as responses. There are two major categories—the
digital delay-based PUFs, which include arbiter PUFs and ring oscillator PUFs
and the memory-based PUFs, which are based on conventional CMOS memories or
emerging nonvolatile memories (NVM) (Fig. 3.3).
The arbiter PUF relies on the digital race condition between two symmetrically
designed paths constructed with switch blocks. The switch blocks can be made of
a pair of multiplexers and buffers with two inputs and two outputs in total. Based
on a parameter input bit (0 or 1), the input and the output pairs are connected in
either a straight or a switched fashion (Fig. 3.5a). The challenge for the arbiter PUF
is the sequence of parameter bits that are fed to the serially connected switch blocks.
Due to manufacturer variations, there is always a slight difference between the two
identically designed paths and thus one path becomes a little faster in propagating
the signal. The random small difference between the two delays is received by an
arbiter circuit, which decides which path wins the race by outputting a 0 or a 1 as the
response. The arbiter circuit is made with a latch or a flip-flop [21, 22] (Fig. 3.5a).
The differential nature of the response from the arbiter output cancels out the linear
environmental factors, such as temperature, power supply voltage, or aging effects,
that both delay lines equally experience [6]. If the delay difference between two
paths is too small, the arbiter circuit output will no longer depend on the race but
rather will be determined by random noise, resulting in metastability of the arbiter
and noisy PUF responses [6].
By concatenating numerous switch blocks together, a large bit string is created
as the challenge and hence an exponentially large number of CRPs (2n number of
CRPs for n switch blocks) can be generated, despite the one-bit response [21, 22].
Due to the large number of CRPs, these PUFs are also categorized as strong PUFs
and are used for authentication applications. After being used, each CRP is marked
as “used” in the server database so it cannot be reused, thus avoiding replay attacks
[4].
Due to the linear additive behavior of digital delays in the basic arbiter PUF, it
is possible to model the entire arbiter PUF system mathematically using machine
learning techniques, and accurate predictions can be made about unused CRPs after
observing a certain number of CRPs. This is called model-building attack and it
breaks the security of this PUF [23]. Subsequent works on arbiter PUFs were intended
to make model-building attacks difficult. XOR-arbiter PUFs [24] and feedforward
arbiter PUFs [25] are two examples of such improvements based on introduction
68 N. Noor and H. Silva
Fig. 3.5 Delay-based CMOS PUFs: a arbiter PUF and b, c ring oscillator PUF with division and
comparator compensation techniques. Schematics redrawn from [6]
of nonlinearity to the delay lines. For the feedforward arbiter PUF, several input
challenge bits were received from the outcomes of some randomly placed interme-
diate arbiter circuits. However, these improved arbiter PUFs were shown to still be
vulnerable to more advanced model-building attack techniques [25, 26].
Ring oscillator (RO) PUFs also rely on delay deviations [23]. In an RO circuit,
the output of a digital delay line is fed back to the input to create an asynchronous
oscillating loop. Due to the manufacturer variations, the delay is random on dif-
ferent identically designed circuits, which in turn determines the resulting random
3 Phase Change Memory for Physical Unclonable Functions 69
frequency of the oscillation. The frequency is measured with an edge detector and
a digital counter circuit connected at the output of the RO. A parameterizable delay
setting is used as the challenge for this PUF and the measured frequency value at
the counter output is used as the analog response for the basic RO PUF. However, as
the resulting frequency greatly depends on temperature and power supply voltage,
the PUF responses become noisier due to fluctuations in environmental fac-
tors. Therefore, compensation techniques are implemented by either dividing or
subtracting the output frequency values from a pair of ring oscillator PUFs [23]
(Fig. 3.5b, c). The type of delay circuits used for RO PUFs is the same as of those
used in arbiter PUFs circuit, and hence similar model-building attacks are possible
[21, 22]. Moreover, an unexpected high correlation exists between the responses
generated from (1) the same challenge inquired on different FPGAs and from (2)
the different challenges inquired on the same FPGA [6]. In later works, only one out
of eight pairs of ROs used is selected to improve the uniqueness and reproducibility
features of RO PUF. This technique is termed as 1-out-of-8 masking [4].
PUFs based on conventional CMOS memory rely on the settling state of a desta-
bilized digital memory circuit (Fig. 3.6). A digital memory cell has two or more
logical states, and, in normal operation, it can be programmed to one of these stable
states and be used for information storage. However, if the memory cell is brought
to an unstable state, it may start oscillating between the possible stable states and,
after a certain time, converge to a preferred state, depending on the uncontrolled
physical mismatch caused during manufacturing [6]. This concept has been used to
implement PUFs with SRAM, latch, and flip-flop circuits.
The SRAM cell is made of two cross-coupled inverters consisting of four metal
oxide semiconductor field effect transistors (MOSFETs) along with two additional
access MOSFETs. Due to the inevitable manufacturer variations, both halves cannot
be made exactly identical, and hence each SRAM cell has a slight inclination toward
one of the two logical states (0 or 1) at power-up condition. For SRAM PUF, the
powering up is the challenge, and the one-bit settling state is the response, resulting
in a single CRP per cell [27] (Fig. 3.6a). Due to the limited number of CRPs and the
linear relation between the CRP size and the number of components, these PUFs are
also categorized as weak PUFs and are used for key generation applications [4].
Very similar concepts have also been applied as follows:
i. latch PUF, where two cross-coupled NOR gates are brought to an unstable state
by a reset signal (as the challenge) and the settling state is observed (as the
response) [28] (Fig. 3.6b),
ii. butterfly PUF, where two cross-coupled latches are brought to an unstable state
by a clear/preset function (as the challenge) and the settling state is observed
(as the response) [29] (Fig. 3.6c),
iii. flip-flop PUF, where power-up condition (challenge) results in a settling state
(response) [30].
70 N. Noor and H. Silva
Fig. 3.6 Memory-based CMOS PUFs: a SRAM PUF, b latch PUF, and c butterfly PUF. Schematics
redrawn from [6]
All the abovementioned PUF technologies, along with the permanent key storage
schemes, can generate unique identifiers or keys. The permanent key storage schemes,
however, require hard programming during manufacturing to generate the keys and
are vulnerable to physical cloning. For PUFs, in contrast, the uncontrollable process
variations prevent manufacturing an exact physical copy [6].
The delay-based PUFs (basic arbiter, feedforward arbiter, and the ring oscillator
PUFs) are prone to model-building attacks and thus fail the unpredictability require-
ment of PUF. Even though the CRP size is exponentially large for the arbiter PUFs, the
security is in peril when even a relatively small number of CRPs have been observed
by the attacker. On the other hand, the CMOS memory-based PUFs (SRAM, latch,
and butterfly PUFs) can be read exhaustively and the entire CRP database will be
known to the attacker. For these memory-based PUFs, the number of CRPs is linear
with the number of cells in the array, and thus it is easier for an attacker to accomplish
the knowledge of the entire CRP database [31]. The same problem also exists for the
coating PUF and ring oscillator PUF with comparator compensation and 1-out-of-8
masking. Hence, mathematical cloning of all these CMOS-based PUFs can be made
even though physical cloning of such technologies is impossible. Once manufac-
tured, the physical mismatch that determines the preferred output state for a given
3 Phase Change Memory for Physical Unclonable Functions 71
challenge stays unchanged for the lifetime of these PUFs. Hence, there is no option
to refresh the CRPs for these PUF technologies [6].
To increase security against mathematical cloning, controlled PUFs (CPUFs)
have been proposed, in which the PUF is complemented with cryptographic algo-
rithms. In CPUF, the PUF is accessed only by the algorithm. A cryptographic hash
function is used to generate randomly picked challenges so that model-building
attacks can be avoided (although this method cannot thwart the model-building
attack on arbiter PUF). The random challenges are used to interrogate the PUF and
the generated responses are then fed to an error correction code (ECC) to improve
the reliability or minimize noise. The output of the ECC is then inputted to another
cryptographic hash function which breaks the link between the responses and the
actual physical details of the PUF measurements [6, 23] (Fig. 3.7a).
Another way to increase the security of the system has been accomplished by
reconfigurable PUFs (RPUFs). In an RPUF, the partial or complete CRP can be
refreshed irreversibly, and thereby a completely new PUF is created after every
refresh. The RPUFs are categorized into two types—logically and physically recon-
figurable PUFs (L-RPUFs and P-RPUFs). In L-RPUF, the responses are interfaced
with a multiplexer, control logic, or control query algorithm for the reconfigura-
tion purpose (Fig. 3.7b) [32]. In P-RPUF, in contrast, the responses are intrinsically
altered due to the physical mechanism involved in refreshing the material properties,
which is not only more efficient than L-RPUF in terms of area but also more secure
against tampering because of the physical origin of stochasticity (Fig. 3.7c) [33, 34].
(a)
C R
Cryptographic Error Correction Cryptographic
PUF
Hash Function Code (ECC) Hash Function
(b) (c) C R
Reconfiguration
PUF
Fig. 3.7 Working principles of a controlled PUF [6], b logically reconfigurable PUF [33], and
c physically reconfigurable PUF [33]. C, R, R’ stand for challenge, response, and reconfigured
response. C int and Rint refer to intermediate challenge and responses
72 N. Noor and H. Silva
CMOS-based PUFs received significant attention and were the focus of rigorous
research efforts for many years. However, the unaddressed challenges of overcoming
mathematical clonability have recently shifted the focus in the PUF field toward novel
nanotechnologies and nanomaterials. Moreover, CMOS technology is approaching
its scaling limits and new technologies are emerging to continue to deliver perfor-
mance improvements with smaller devices [35].
For storage applications, various kinds of resistive switching memory technolo-
gies have been demonstrated with promising speed, endurance, retention time, scal-
ability, and energy efficiency. The most progress has been made in the fields of phase
change memory (PCM), resistive random access memory (RRAM), and spin-transfer
torque magnetic random access memory (STT-MRAM) technologies. These nanode-
vices offer easy fabrication with simple cell structures (Fig. 3.8) [36]. All these
technologies incorporate compact two-terminal devices relying on resistive switch-
ing. These memory devices can be reversibly programmed to various resistance
levels by suitable electrical pulses. The programmed states are easily distinguishable
as distinct states and are stable under normal operating conditions (such as room
temperature and supply voltage) assuring long data retention time.
These novel memories can potentially produce lightweight, robust, secure, and
reconfigurable PUFs and other security primitives to meet the next-generation secu-
rity challenges. An intrinsic property of these nanodevices, variability, typically a
disadvantage for memory implementation, is an important advantage for PUF appli-
cations (in addition to the existing process variation present in any technology). The
programming variability is observed on the same cell for different cycles of operation
(cycle-to-cycle variability) as well as on different cells for the same programming
conditions (cell-to-cell variability). As the randomness originates from the stochas-
tic rearrangement at the atomic scale, it is impossible to formulate or predict the
variability pattern. The cycle-to-cycle programming variability allows the recon-
figurability feature for a PUF since new CRPs are obtained after each reprogram-
Fig. 3.8 Schematics of typical cell structures for a phase change memory (PCM), b resistive
random access memory (RRAM), and c spin-transfer torque magnetic random access memory
(STT-MRAM) cells (not drawn to scale)
3 Phase Change Memory for Physical Unclonable Functions 73
PCM was first introduced by Ovshinsky in the late 1960s with the Ovonic threshold
switch (OTS) phenomenon [39], which also showed promise for repeated memory
operation [40]. However, the low programming speed and the high programming
energy obtained from the prototype devices [41] waned interest in PCM as an elec-
tronic memory and rather deviated the following research initiatives toward the opti-
cal data storage field during the 1990s and 2000s [42]. In the early 2000s, advances
in PCM materials with improved scalability, speed, and resistivity contrast led to
renewed interest in PCM. PCM was then envisioned as a “universal memory” that
could potentially replace both DRAM and NAND flash. [43]. However, the high reset
current in PCM hindered the scaling pace to compete with NAND flash and the writ-
ing speed and endurance could not reach DRAM standards either. Considering all the
progress and remaining limitations, PCM is now regarded as storage class memory
(SCM), a complementary technology to bridge the latency gap between NAND flash
and DRAM (Fig. 3.9) [43, 44], together with RRAM and MRAM. PCM can either
serve as storage-type SCM, for which high density is the main requirement or as
memory-type SCM, for which high endurance (≥1012 ) and high reset and set speeds
(<50 ns) are the critical requirements. Multi-level cell (MLC) storage (e.g., 2 bits per
cell with four distinct resistance states) and further progress in scaling, beyond the
4–6 F2 cell with sub-10 nm feature size, can lead to further improvements of PCM
[43, 45].
PCM stores information on the phase of a chalcogenide material that can be
reversibly switched between two (or more) stable states with distinct resistance levels.
The high resistance state (HRS) in PCM is the amorphous or reset state and the low
resistance state (LRS) is the crystalline or set state (Fig. 3.10). The most used and
studied phase change material has been the Ge2 Sb2 Te5 compound (GST-225) due to
its crystallization speed, stability, and resistivity contrast between the amorphous and
crystalline phases. In a typical PCM cell, the GST material is sandwiched between
two contacts. The mushroom cell has been the standard cell, in which the phase
change material is placed above a nanoscale bottom contact, called the heater, since
74 N. Noor and H. Silva
Latency
Fig. 3.9 Comparison of the latency of different memory and storage technologies. PCM along with
RRAM and MRAM are competing to bridge the latency gap between the conventional memory and
storage devices [46]
Voltage Reset
Temperature pulse
Tmelt
Fig. 3.10 Schematics of the reset (amorphization), the set (crystallization), and the low-voltage
read operations in PCM [47]
it defines the highest current density region for Joule heating and the minimum
amorphous plug required to block conduction in the high resistance state [47, 48].
For amorphization (reset), an amorphous plug is created on the phase change
material just above the heater (also called the active region) by rapid cooling after
melting at or above ~900 K (melt-quench) by a high amplitude short electric pulse
(typically <50 ns) terminated abruptly [43]. The rapid quench results in random
atomic rearrangements upon solidification and leaves the material in its amorphous
phase (Fig. 3.11a). For crystallization (set), an electric pulse with moderate amplitude
and longer duration (100 ns − 10 µs [43]) brings the active region above the crys-
tallization temperature (~500 K) yet below the melting temperature, for a sufficient
period to allow recrystallization (Fig. 3.11b).
The high reset current requirement in mushroom cell necessitates large selector
device, blocking the path toward further scaling [43]. Later cell designs showed
3 Phase Change Memory for Physical Unclonable Functions 75
(a) (b)
Top electrode Top electrode
Crystalline Crystalline
Amorphous
plug
Fig. 3.11 a Crystalline or set state and b amorphized or reset state in a mushroom cell
improved power consumption and reliability, and currently, the preferred cells are
confined cells which improve thermal confinement and minimize the active region
that must be amorphized and crystallized [49]. PCM operation has been demon-
strated for active regions as small as ~1 nm using a carbon nanotube as the critical
contact [50], thus demonstrating PCM high scalability potential compared to other
traditional memory technologies. Alternative compositions of GST or other chalco-
genide materials, as well as alternative deposition methods such as atomic layer
deposition (ALD) and chemical or physical vapor deposition (CVD or PVD), have
also shown improved and promising endurance [51], low reset currents [52], fast
crystallization, and good scalability.
The large memory window in PCM (resistance contrast between the amorphous
and crystalline phases) enables MLC storage, by which programming to multiple,
distinct resistance levels between the full reset and set states allows for storage of
more than one bit per cell for a significant increase in memory density (Fig. 3.12a)
[53]. The intermediate states are achieved by partial reset or partial set operations
either with a single pulse, with amplitude in between that of the full reset and the
full set pulses, or gradually by applying repeated smaller amplitude reset or set
pulses [54–56]. By applying appropriate voltages to the perpendicularly arranged
word line and bit line metals, the PCM cell at the cross-point is addressed in the
crossbar architecture (Fig. 3.12b) [46, 47]. Besides device downsizing, MLC, and
crossbar architecture, high density is also achieved with the 3D stackability of PCM
(Fig. 3.12b).
Despite the tremendous progress made in the PCM device research for the past few
years, PCM has the cell-to-cell and cycle-to-cycle resistance variability issue. The
76 N. Noor and H. Silva
(a) (b)
3D
106-7 Amorphous or high stackable
resistance state (HRS)
Bit line
Selector
Resistance (Ω)
Memory cell
Mixed states or Word line
intermediate states
Crystalline or low
102-3 resistance state (LRS)
Fig. 3.12 a Large memory window in PCM cells enable multi-level cell (MLC) operation with
highly dense mixed states. b 3—dimensionally integrable crossbar architecture for high-density
PCM [47]
variability can occur during both the reset and set operations, even if the pulse parame-
ters are chosen for successful switching. This phenomenon is known as programming
or switching variability and it can be more severe in the weak programming regimes,
i.e., in partial reset and partial set operations. An attempt to program a cell toward
the partial reset regime from the crystalline state may not always be successful and a
moderate pulse chosen for this operation may leave the cell unchanged, at the initial
crystalline state, or instead program it fully into the high resistance amorphous state
[57]. These unpredictable partial programming operations in PCM limit memory
performance but enable valuable implementations for PUF applications.
The cycle-to-cycle stochastic nature in the reset operation originates from the
random atomic rearrangement that takes place during the rapid melt-quench process
[58]. On the other hand, the cycle-to-cycle stochastic nature in the set operation stems
from the random spatial arrangement of the seed crystals that remain or nucleate after
amorphization and from where crystallization will proceed [59, 60]. Moreover, the
initial state for either operation also varies depending on the history of the cell. For
example, the initial crystalline resistance of the same cell is an important factor on
the result of the following operations, and compounds on the overall variability. In
case of the cell-to-cell programming variability, for both reset and set operations,
process variations also add to the intrinsic cell variability. Process variations include
geometry variations (thickness, length, or width of the active regions and contacts)
and local material variations.
Due to spontaneous resistance drifts, PCM cells do not remain at the programmed
resistance levels, and the drift history of cells also add to the variabilities in fol-
lowing programming cycles. The hexagonal close packed (hcp) phase is the stable
phase of GST and does not experience drift but devices typically operate between
the metastable amorphous and crystalline face-centered cubic (fcc) and both phases
experience resistance drifts. The crystalline fcc state shows a slight upward resis-
tance drift within typical time scales and for longer periods it slowly transitions to
3 Phase Change Memory for Physical Unclonable Functions 77
Fig. 3.13 a Experimental results of resistance drift observed from GST-225 line cells at various
temperature levels. b Drift coefficient measured from the resistance versus time bilogarithmic plot
measured from a line cell at 400 K [65, 67]
the stable hcp phase with a decrease in resistance. The amorphous phase, on the other
hand, shows a significant steady upward resistance drift at the beginning, which sat-
urates after a certain time (depending on temperature) after which the resistance also
decreases until complete crystallization. The resistance drift trends accelerate with
temperature causing faster data loss at higher temperatures [61–65] (Figs. 3.13a
and 3.14b). The upward resistance drift in amorphous phase follows a power-law
behavior and the slope of the bilogarithmic resistance–time plot is called the drift
exponent or drift coefficient [66] (Fig. 3.13b). Higher drift coefficient indicates a
faster increase of the amorphous resistance during the upward resistance drift. The
drift coefficient depends on the temperature [65–67], read current [66], and the pro-
grammed resistance level [58]. Drift itself is a stochastic process in PCM (Fig. 3.14a)
and has been explained as related to structural relaxation of the amorphous material
after the rapid melt-quench process [66] and to charge trapping and detrapping from
incipient nuclei during early crystallization [67, 68] (Fig. 3.15).
PCM cells also experience read disturb when a read operation, with higher than
normally applied current, results in localized heating which can cause a perturba-
tion of the cell state and thermal cross-talk or program disturb when programming
neighboring cells can also result in a sufficient increase in temperature that can dis-
turb the state of the cell [69]. PCM devices also exhibit various types of noise and
fluctuations in electrical current measurements. Random telegraph noise (RTN) has
been observed at intermediate states (~600 k resistance level of µ-trench cells [70])
when read with a certain voltage and at a certain ambient temperature. The RTN has
been ascribed to the same physical mechanism causing the amorphous resistance
drift, i.e., structural relaxation. The random possibility for a defect to reside on one
of the two equally energetically favorable states is reported for the possible mecha-
78 N. Noor and H. Silva
Fig. 3.14 a Experimental results of spread of drift coefficients and its dependency on temperature.
The small scatter points are the drift coefficients measured on different cells and the large scatter
points are the average of all values measured at the corresponding temperature. b Dependency of
crystallization time on temperature [65, 67]
(a) (b)
Top electrode Top electrode
Crystalline Crystalline
Fig. 3.15 Schematic example of random spatial arrangement of the seed crystals nucleated inside
the amorphous plug resulting in random resistance evolution over time during the crystallization
process in PCM mushroom cell. Schematic redrawn from [60]
nism for RTN [70]. Moreover, current measurements at crystalline and amorphous
states of PCM show random fluctuations of current, known as the 1/f behavior or the
flicker noise [71].
The programming variability, resistance drifts, read disturb, thermal cross-talk,
and RTN noise in PCM devices cause reliability problems for memory implemen-
tations but can be utilized for hardware security applications such as PUFs or true
random number generators (TRNGs).
3 Phase Change Memory for Physical Unclonable Functions 79
3.3.3.1 Concept-Only
Among the very few PCM-based PUFs reported, the first one was a concept-only
work, by Kursawe et al. in 2009 [34], with the introduction of RPUF (Reconfigurable
PUF) for the first time. According to the authors, the read process in PCM is much
more controlled than the writing process. Thus, the randomly programmed state in an
MLC scheme can generate the random multi-bit responses and can be reconfigured
as a new PUF, with a refreshed set of CRPs, by reprogramming the cell every time.
This work described the idea only, with no simulation or experimental validation.
3.3.3.2 Simulation
The following two works on PCM-based PUFs were published by Zhang et al. in
2013 and 2014. The first one, PCKGen or PCM cryptographic key generator, was
demonstrated with simulation [72], while the second one showed detailed experi-
mental validation of slightly different approaches [33]. For this PUF, the challenge
was the addresses of a pair of PCM cells that were reconfigured, and the response
was the comparison between the cell resistances. The first paper [72] was based on
three core points:
I. Due to the natural log-normal distribution of the PCM cell resistance, a log-
arithmic amplifier (LogAmp) was employed to the cell read path to reshape
the cell resistance distribution in the linear domain. This method removed the
undesired bias in the bit pattern of the cryptographic key output and maximized
the entropy.
II. An imprecisely controlled current-pulse regulator (ICCR) was used to generate
probabilistic current pulses to reconfigure the PCM cells. The ICCR incorpo-
rates a current-mode digital-to-analog converter (DAC) and a pulse shaper.
These two circuits generate m-bit and n-bit digital bit strings to control the
amplitude and the duration of the applied current pulses to the PCM cells,
respectively. A similar approach was also used in their following experimental
work [33], with voltage pulses instead of the current pulses for programming.
III. A post-processing module (PPM) was used to improve the raw response qual-
ity. The authors indicated that the raw responses might incorporate noise and
might not be truly random. The PPM helped the responses to be unpredictable
and stable over time. Error correction code (ECC) with helper data (stored in
nonvolatile PCM cells with well-maintained security) and a subsequent hash
function were used to improve the response quality.
A numerical PCM model [73] was used for the simulations performed in this work
along with the simulations of the auxiliary CMOS circuits with Cadence 90 nm design
environment. The simulation of the security analysis showed improved bias with the
80 N. Noor and H. Silva
LogAmp, a reduced error rate with ECC, and ~50% of uniqueness with the hash
function [72].
In their experimental validation, 180-nm PCM cells were used with both single and
repeated pulsing attempts separately. The address of a PCM cell was used for the
challenge and the resulting resistance upon the reconfiguration or reprogramming
was the response for this PUF [33].
For the single pulse programming approach, the partial reset programming method
was used, which requires less programming time as compared to the partial set oper-
ation. Using staircase down (SCD) pulses, the entire PCM array was initialized to the
full set state to maintain similar initial conditions for all cells prior to the program-
ming. The PCM cells were then partially reset with single rectangular applied pulses
with variable pulse amplitudes and durations (nonuniform programming scheme in
Fig. 3.16). The authors indicated that the programmed resistance of the identical
PCM cells should only depend on the fabrication process variations and the applied
programming pulse parameters. However, this overlooks the inherent programming
variability originating from the nanoscale switching mechanism in PCM, which is
a key contributing factor for the cell-to-cell programming stochasticity, besides the
unavoidable process variations [33].
The random pulse parameters for the nonuniform programming were generated
using an imprecisely controlled regulator (ICR), as in their simulation-based work,
which was argued to be a more secure scheme against physical attacks as compared
to the conventional hash-mode and TRNG-mode methods with digital-to-analog
converters (DAC). The ICR circuit includes a programming voltage generator and a
pulse generator [33] (Fig. 3.17).
An m-bit binary input string can produce 2m possible configuration states for the
programming voltage amplitudes and an n-bit binary input string randomly selects
the delay chain from n sets of inverter delay chains; hence, 2n possible pulse durations
can be generated [33].
Progs (A, t)
PCM- Resistance variation due to
(a) RPUF (1) fabrication variations only
Challenge Response
Fig. 3.16 Using the a uniform and b nonuniform programming pulse parameters for achieving
spread of programmed resistance for PCM-based RPUF [33]
3 Phase Change Memory for Physical Unclonable Functions 81
(a) (b)
RPUF RPUF
C R C R
S
DAC Hash DAC TRNG
S
(c)
RPUF
C R
Fig. 3.17 a, b The conventional approaches of generating random configuration states with hash-
mode and TRNG-mode along with digital-to-analog converters (DAC). The configuration state S is
prone to physical attack for both cases. c Random configuration state generation using imprecisely
controlled regulator (ICR) along with DAC is reported to be more secure [33]
i. Periodic refresh: the CRPs are refreshed in a periodic fashion after every t R
duration. A clock can be used to monitor the time elapsed.
ii. Frequency-based refresh: the CRPs are refreshed after a certain number of evalu-
ations nR take place. A counter can be used to monitor the number of evaluations.
In the same work, an alternative partial reset programming strategy with grad-
ually increasing repeated pulses (termed as staircase up or SCU) was also used
along with a program-and-verify scheme. In this programming method, the cell resis-
tance was measured in between the consecutive programming pulses and compared
with a target programmed resistance (Rtarget ) to determine if additional pulses were
required. Based on the manufacturing variations, as indicated by the authors, and we
add here—in addition to the programming variability itself—the number of pulses
required to reach a certain Rtarget varied randomly. Moreover, the use of the varying
step incremental voltage and the varying pulse durations introduced additional vari-
ability to the responses. The challenge for the PUF was again the cell address and the
raw response was the number of pulses required to generate a certain Rtarget . The raw
responses were post-processed for quantization of the output into digital bits. The
error probability in the PUF output increased with an increasing number of output
bits used for quantization. By increasing the programming time with the gradual
reset method, the total programming time was made equal to the time between two
consecutive reconfigurations in the single pulse programming method. Hence, the
number of accessible CRP between two reconfigurations was only one, significantly
improving the security of the repeated pulse programming as compared to the single
pulse programming [33].
A recent work has discussed the dependence of programming variability on the PCM
cell design and how this can be leveraged for hardware security purposes [74]. This
paper reports that since in mushroom cells or µ-trench cells the active region (volume
that is switched between amorphous and crystalline) is adjacent to the contact, the
switching mechanism is strongly dependent on the shape of the heater–chalcogenide
interface and on the heater material used [75, 76]. These cells were observed to have
smaller programming variability and the process variations (mostly on the heater
definition) are likely to dominate over the variability associated with the switching
location within the PCM material. In contrast, in line cells, the active region is within
the phase change material, away from the contacts, and the larger variability in these
cells is due mostly to the switching variations within the PCM material [77, 78]. The
authors indicated that the variability in PCM line cells can be even larger than that
observed for oxide-based resistive RAMs which are known for large programming
variability [79]. The authors have simulated the switching from the full reset state
using a precalibrated voltage for a large number of PCM line cells. The cells fail or
succeed to switch toward the crystalline resistance state with equal probability and
this random switching was proposed for TRNG implementations.
Table 3.1 Comparison of various PUF properties for different technologies. Table adapted from [6]. The PCM-based PUF properties have been added here as
the last row
PUF name Randomness Challenge Response Tamper Unique? Re- Physically Mathematically Unpredictable?
type evident? producible? unclonable? unclonable?
RFID-like Secret key is Interrogation Permanent – Yes Yes No No No
protocol explicitly hard by a reader secret key
programmed
during
manufacture
Optical PUF Explicitly Laser Gabor hash of Yes Yes Yes Yes Yes Yes
introduced orientation speckle
pattern
Coating PUF Explicitly Sensor Quantized Yes Yes Yes Yes Noc Yes
introduced selection capacitance
Basic arbiter Implicit Delay line Arbiter – Yes Yes Yes Nod !f
PUF manufacturer setting decision
variability
Feedforward Implicit Delay line Arbiter – Yes Yes Yes Nod !f
arbiter PUF manufacturer setting decision
3 Phase Change Memory for Physical Unclonable Functions
variability
RO PUF Implicit Delay line Frequency – Yes Yes Yes Nod !f
w/division manufacturer setting division
variability
RO PUF Implicit Loop pair Frequency – Yes Yes Yes Noc Yes
w/comparator manufacturer selection comparison
and 1-out-of-8 variability
mask
SRAM PUF Implicit SRAM Power-up state – Yes Yes Yes Noc Yes
manufacturer address
variability
(continued)
83
Table 3.1 (continued)
84
PUF name Randomness Challenge Response Tamper Unique? Re- Physically Mathematically Unpredictable?
type evident? producible? unclonable? unclonable?
Latch PUF Implicit Latch Settling state – Yes Yes Yes Noc Yes
manufacturer selection of destabilized
variability latch
Butterfly PUF Implicit Cell selection Settling state – Yes Yes Yes Noc Yes
manufacturer of destabilized
variability cell
PCM-PUF Implicit Cell selection Programmed Yesa Yes Yesb Yes Yese Yes
based on manufacturer or not?
programming variability and
variability Programming
variability
a Tamper evidence feature can be implemented using the read disturb property and thermal cross-talk of PCM (if read voltages are sufficiently high)
b For a PCM-PUF based on the variability of resistance of cells programmed to either state, the CRP can be reproduced. The CRPs based on amorphous and
crystalline fcc states can, however, become noisy at elevated temperatures and also overtime due to resistance drifts (these do not affect the stable crystalline
hcp states). For the programming-variability-based PCM-PUF, the challenge is a programming pulse after an initializing crystallization procedure. It might not
be possible to obtain the same programming result on consecutive trials of the same applied programming pulse on the same PCM cell to check reproducibility
if weak programming strategy is used to randomly pass or fail a cell to program. However, the reading operations done before the next programming cycle will
retain the state due to nonvolatility and long retention time, and hence the reproducibility of the state during read can be ensured for a certain programming
cycle. Moreover, in the programming-variability-based PCM-PUF, the CRPs can be refreshed by reprogramming the cell (periodically or after a certain number
of measurements) to prevent physical attacks and this can be done for a very large number of times, limited by cell endurances, >~1010 cycles)
c By exhaustively reading out all CRPs
d By model-building attack
e PCM programming variability depends on complex stochastic switching mechanisms within the phase change material that cannot be predicted or modeled
exactly
f When an adversary learns more and more CRPs, it becomes increasingly easier to predict the unseen CRPs
N. Noor and H. Silva
3 Phase Change Memory for Physical Unclonable Functions 85
Table 3.1, adapted from the review by Maes and Verbauwhede in 2010 [6], compares
the main PUF properties for different technologies that have been considered. We
have added the security performance properties of PCM-based PUFs as the last row.
It is interesting to note that PCM and optical PUFs are the only types considered to
be mathematically unclonable, an important feature for high-security applications.
It is also important to emphasize here that PCM-based PUFs (and other emerging
NVM-based PUFs such as RRAM-PUFs or STT-MRAM-PUFs) are reconfigurable
PUFs which also enables higher security implementations.
3.4 Outlook
Phase change memory provides a new platform for hardware security primitives
such as PUFs and TRNGs. These applications are enabled through several unique
properties of PCM:
– programming variability due to complex mechanisms behind amorphization and
crystallization processes of phase change materials that cannot be reproduced,
predicted, or modeled exactly;
– high endurance devices that enable very large (practically unlimited) sets of
refreshed, nonvolatile challenge-response pairs;
– read disturb and thermal cross-talk effects that can potentially be used for tamper
evidence schemes;
– resistance drifts of the amorphous and crystalline fcc states that add to the intrinsic
programming variability in following cycles.
Although extensive research efforts have made PCM the well-established mem-
ory technology it is today, most have focused on improving memory performance,
which includes minimizing variability, and only a few reports have discussed the
intentional use of variability for hardware security. Further research on PCM mate-
rials and devices focusing on hardware security applications is therefore needed to
better understand and utilize variability in these devices. Suitable electrical charac-
terization and data analysis techniques are also needed to study variability in PCM
devices, as well as in other emerging nonvolatile nanoscale devices, RRAM or STT-
MRAM, which also offer promising properties for reconfigurable hardware security
primitives.
Acknowledgements This work has been funded by the Air Force Office of Scientific Research
(AFOSR) through award FA9550-14-1-0351Z (MURI: Universal Security Theory for Evaluation
and Design of Nano-scale Devices and for Development of Innovative Security Primitives). The
authors would like to thank the members of the Nanoelectronics Laboratory at University of Con-
necticut and of the MURI team, with special thanks to Fahim Rahman and Bicky Shakya from
University of Florida and Chenglu Jin and Phuong Ha from University of Connecticut for valu-
86 N. Noor and H. Silva
able discussions on hardware security primitives, and Professor Ali Gokirmak from University of
Connecticut for his help in device physics understanding.
References
1. S. Sicari, A. Rizzardi, L.A. Grieco, A. Coen-Porisini, Security, privacy and trust in Internet
of Things: The road ahead. Comput. Networks 76, 146–164 (2015). https://doi.org/10.1016/J.
COMNET.2014.11.008
2. D. Evans, The internet of things: how the next evolution of the internet is changing everything.
Cisco Internet Bus. Solut. Gr. 1(2011), 1–11 (2011)
3. H. Handschuh, G.-J. Schrijen, P. Tuyls, Hardware intrinsic security from physically unclon-
able functions, in Towards Hardware-Intrinsic Security, ed. by A.-R. Sadeghi, D. Naccache
(Springer, Berlin Heidelberg, Germany, 2010), pp. 39–53
4. C. Herder, M.D. Yu, F. Koushanfar, S. Devadas, Physical unclonable functions and applica-
tions: a tutorial. Proc. IEEE 102(8), 1126–1141 (2014). https://doi.org/10.1109/JPROC.2014.
2320516
5. Q. Xiao, T. Gibbons, H. Lebrun, RFID technology, security vulnerabilities, and countermea-
sures, in Supply Chain the Way to Flat Organisation, ed. by Y. Huo, F. Jia (InTech, Vienna,
Austria, 2009), p. 404
6. R. Maes, I. Verbauwhede, Physically unclonable functions: a study on the state of the art
and future research directions, in Towards Hardware-Intrinsic Security, 1st edn., ed. by A.-R.
Sadeghi, D. Naccache (Springer, Berlin Heidelberg, Germany, 2010), pp. 3–37
7. B.C. Grubel, B.T. Bosworth, M.R. Kossey, H. Sun, A.B. Cooper, M.A. Foster, A.C. Foster,
Silicon photonic physical unclonable function. Opt. Express 25(11), 12710 (2017). https://doi.
org/10.1364/OE.25.012710
8. S.N. Graybeal, P.B. Mcfate, S.N. Graybeal, P.B. Mcfate, Getting Out of the STARTing Block.
Sci. Am. 261(6), 61–67 (2017). https://doi.org/10.2307/24987511
9. R. Pappu, B. Recht, J. Taylor, N. Gershenfeld, Physical one-way functions. Science (80– )
297(5589), 2026–2030 (2002). https://doi.org/10.1126/science.1074376
10. P. Tuyls, B. Škorić, Strong authentication with physical unclonable functions, in Security,
Privacy, and Trust in Modern Data Management, ed. by M. Petković, W. Jonker (Springer,
Berlin, Heidelberg, Germany, 2007), pp. 133–148
11. U. Rührmair, C. Hilgers, S. Urban, A. Weiershäuser, E. Dinter, B. Forster, C. Jirauschek, Optical
PUFs reloaded, in Eprint. Iacr, Org (2013)
12. D.W. Bauder, An anti-counterfeiting concept for currency systems, in Sandia Natl. Labs, Albu-
querque, NM, Tech. Rep. PTK-11990 (1983)
13. G. Hammouri, A. Dana, B. Sunar, CDs have fingerprints too, in Cryptographic Hardware
and Embedded Systems-CHES 2009, ed. by C. Clavier, K. Gaj (Springer, Berlin, Heidelberg,
Germany, 2009), pp. 348–362
14. G. DeJean, D. Kirovski, RF-DNA: radio-frequency certificates of authenticity, in Cryptographic
Hardware and Embedded Systems-CHES 2007, ed. by P. Paillier, I. Verbauwhede (Springer,
Berlin, Heidelberg, Germany, 2007), pp. 346–363
15. R. Indeck, M. Muller, Method and apparatus for fingerprinting magnetic media, US Patent No.
5,365,586 (1994)
16. S. Vrijaldenhoven, Acoustical physical uncloneable functions, M.S. thesis, Department of
Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, Nether-
lands, 2004. Available: https://pure.tue.nl/ws/files/46971492/600055-1.pdf. Accessed 21 Mar
2019
3 Phase Change Memory for Physical Unclonable Functions 87
17. K. Lofstrom, W. Daasch, D. Taylor, IC identification circuit using device mismatch, in 2000
IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.
00CH37056), 9 Feb 2000, San Francisco, CA, USA (Online). Available: IEEE Xplore, https://
ieeexplore.ieee.org/document/839821. Accessed 21 Mar 2019. https://doi.org/10.1109/ISSCC.
2000.839821
18. R. Helinski, D. Acharyya, J.P. Annual, A physical unclonable function defined using power
distribution system equivalent resistance variations, in Proceedings of the 46th Annual Design
Automation Conference. ACM, Jul 26–31 2009, San Francisco, CA, USA (Online). Available:
IEEE Xplore, https://ieeexplore.ieee.org/document/5227103. Accessed 21 Mar 2019. https://
doi.org/10.1145/1629911.1630089
19. P. Tuyls, G. Schrijen, B. Škorić, Read-proof hardware from protective coatings, in International
Workshop on Cryptographic Hardware and Embedded Systems, eds. by L. Goubin M. Matsui
(Springer, Berlin, Heidelberg, Germany, 2006)
20. J. Guajardo, B. Škorić, P. Tuyls, S.S. Kumar, T. Bel, A.H.M. Blom, G.-J. Schrijen, Anti-
counterfeiting, key distribution, and key storage in an ambient world via physical unclonable
functions. Inf. Syst. Front. 11(1), 19–41 (2009). https://doi.org/10.1007/s10796-008-9142-z
21. J. Lee, D. Lim, B. Gassend, G. Suh, M. van Dijk, S. Devadas, A technique to build a secret
key in integrated circuits for identification and authentication applications, in 2004 Symposium
on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No. 04CH37525), 17–19 Jun 2004,
Honolulu, HI, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/
1346548. Accessed 21 Mar 2019. https://doi.org/10.1109/VLSIC.2004.1346548
22. D. Lim, J.W. Lee, B. Gassend, G.E. Suh, M. van Dijk, S. Devadas, Extracting secret keys from
integrated circuits. IEEE Trans. Very Large Scale Integr. Syst. 13(10), 1200–1205 (2005).
https://doi.org/10.1109/TVLSI.2005.859470
23. B. Gassend, D. Clarke, M. van Dijk, S. Devadas, Silicon physical random functions, in Pro-
ceedings of the 9th ACM conference on Computer and communications security, 18–22 Nov
2002, Washington, DC, USA (Online). Available: ACM Digital Library, https://dl.acm.org/
citation.cfm?id=586132. Accessed 21 Mar 2019. https://doi.org/10.1145/586110.586132
24. B. Gassend, D. Lim, D. Clarke, M. van Dijk, S. Devadas, Identification and authentication of
integrated circuits. Concurr. Comput. Pract. Exp. 16(11), 1077–1098 (2004). https://doi.org/
10.1002/cpe.805
25. U. Rührmair, J. Sölter, F. Sehnke, On the foundations of physical unclonable functions, in IACR
Cryptology ePrint Archive (2009), p. 277
26. M. Majzoobi, F. Koushanfar, M. Potkonjak, Testing techniques for hardware security, in
IEEE International Test Conference (ITC), 28–30 Oct 2008, no. 31.3, Santa Clara, CA, USA
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/4700636. Accessed 21
Mar 2019. https://doi.org/10.1109/TEST.2008.4700636
27. D.E. Holcomb, W.P. Burleson, K. Fu, Initial SRAM state as a fingerprint and source of true
random numbers for RFID tags, in Proceedings of the Conference on RFID Security, Graz,
Austria, vol. 7. no. 2, p. 1 (2007)
28. Y. Su, J. Holleman, B. Otis, A 1.6 pJ/bit 96% stable chip-ID generating circuit using pro-
cess variations, in IEEE International Solid-State Circuits Conference (ISSCC), 11–15 Feb
2007, San Francisco, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/
document/4242437. Accessed 21 Mar 2019
29. S.S. Kumar, J. Guajardo, R. Maes, G.J. Schrijen, P. Tuyls, The butterfly PUF protecting IP on
every FPGA, in 2008 IEEE International Workshop on Hardware-Oriented Security and Trust,
9 Jun 2008, Anaheim, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/
document/4559053. Accessed 21 Mar 2019. https://doi.org/10.1109/HST.2008.4559053
30. R. Maes, P. Tuyls, I. Verbauwhede, Intrinsic PUFs from flip-flops on reconfigurable devices,
in 3rd Benelux workshop on information and system security (WISSec 2008), Eindhoven,
Netherlands, vol. 17 (2008)
88 N. Noor and H. Silva
48. H.P. Wong, S. Raoux, S. Kim, J. Liang, J.P. Reifenberg, B. Rajendran, M. Asheghi, K.E.
Goodson, Phase change memory. Proc. IEEE 98(12), 2201–2227 (2010)
49. D.H. Im, J.I. Lee, S.L. Cho, H.G. An, D.H. Kim, I.S. Kim, H. Park, D.H. Ahn, H. Horii, S.O.
Park, U.-I. Chung, J.T. Moon, A unified 7.5 nm dash-type confined cell for high performance
PRAM device, in 2008 IEEE International Electron Devices Meeting (IEDM), 15–17 Dec
2008, San Francisco, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/
document/4796654. Accessed 21 Mar 2019. https://doi.org/10.1109/IEDM.2008.4796654
50. J. Liang, R.G.D. Jeyasingh, H.Y. Chen, H.S.P. Wong, A 1.4 µA reset current phase change
memory cell with integrated carbon nanotube electrodes for cross-point memory applica-
tion, in 2011 Symposium on VLSI Technology - Digest of Technical Papers, 14–16 Jun 2011,
Honolulu, HI, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/
5984659. Accessed 21 Mar 2019
51. W. Kim, M. BrightSky, T. Masuda, N. Sosa, S. Kim, R. Bruce, F. Carta, G. Fraczak, H.Y. Cheng,
A. Ray, Y. Zhu, H.L. Lung, K. Suu, C. Lam, ALD-based confined PCM with a metallic liner
toward unlimited endurance, in 2016 IEEE International Electron Devices Meeting (IEDM),
3–7 Dec 2016, San Francisco, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.
ieee.org/document/7838343. Accessed 21 Mar 2019. https://doi.org/10.1109/IEDM.2016.
7838343
52. H.L. Lung, Y.H. Ho, Y. Zhu, W.C. Chien, S. Kim, W. Kim, H.Y. Cheng, A. Ray, M. Brightsky,
R. Bruce, C.W. Yeh, C. Lam, A novel low power phase change memory using inter-granular
switching, in 2016 IEEE Symposium on VLSI Technology, 14–16 Jun 2016, Honolulu, HI, USA
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/7573405. Accessed 21
Mar 2019. https://doi.org/10.1109/VLSIT.2016.7573405
53. T. Nirschl et al., Write strategies for 2 and 4-bit multi-level phase-change memory, in 2007
IEEE International Electron Devices Meeting (IEDM), 10–12 Dec 2007, Washington, DC, USA
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/4418973. Accessed 21
Mar 2019. https://doi.org/10.1109/IEDM.2007.4418973
54. M. Stanisavljevic, A. Athmanathan, N. Papandreou, H. Pozidis, E. Eleftheriou, Phase-change
memory: Feasibility of reliable multilevel-cell storage and retention at elevated tempera-
tures, in 2015 IEEE International Reliability Physics Symposium (IRPS), 19–23 Apr 2015
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/7112747. Accessed 21
Mar 2019. https://doi.org/10.1109/IRPS.2015.7112747
55. N. Papandreou, H. Pozidis, A. Pantazi, A. Sebastian, M. Breitwisch, C. Lam, E. Elefthe-
riou, Programming algorithms for multilevel phase-change memory, in 2011 IEEE Interna-
tional Symposium of Circuits and Systems (ISCAS), 15–18 May 2011, Rio de Janeiro, Brazil
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/5937569. Accessed 21
Mar 2019. https://doi.org/10.1109/ISCAS.2011.5937569
56. N. Noor, S. Muneer, L. Adnane, R.S. Khan, R. Ramadan, F. Dirisaglik, A. Cywar, C. Lam,
Y. Zhu, A. Gokirmak, H. Silva, Pulse-mode electrical resistance trimming of Ge2Sb2Te5
phase change memory (PCM) line cells, in 2016 International Semiconductor Device Research
Symposium (ISDRS), Bethesda, MD, USA, 7–9 Dec 2016
57. N. Noor, S. Muneer, L. Adnane, R.S. Khan, A. Gorbenko, F. Dirisaglik, A. Cywar, C. Lam, Y.
Zhu, A. Gokirmak, H. Silva, Utilizing programming variability in phase change memory cells
for security, in 2017 Mater. Res. Soc. (MRS) Fall Meeting & Exhibit, Boston, MA, USA, 26
Nov–1 Dec 2017
58. M. Boniardi, D. Ielmini, S. Lavizzari, A.L. Lacaita, A. Redaelli, A. Pirovano, Statistics of resis-
tance drift due to structural relaxation in phase-change memory arrays. IEEE Trans. Electron
Devices 57(10), 2690–2696 (2010). https://doi.org/10.1109/TED.2010.2058771
59. U. Russo, D. Ielmini, A. Redaelli, A.L. Lacaita, Intrinsic data retention in nanoscaled phase-
change memories—Part I: Monte Carlo model for crystallization and percolation. IEEE Trans.
Electron Devices 53(12), 3032–3039 (2006). https://doi.org/10.1109/TED.2006.885527
60. B. Gleixner, A. Pirovano, J. Sarkar, F. Ottogalli, E. Tortorelli, M. Tosi, R. Bez, Data reten-
tion characterization of phase-change memory arrays, in 2007 IEEE International Reliability
90 N. Noor and H. Silva
Physics Symposium (IRPS), 15–19 Apr 2007, Phoenix, AZ, USA (Online). Available: IEEE
Xplore, https://ieeexplore.ieee.org/document/4227689. Accessed 21 Mar 2019
61. F. Dirisaglik, G. Bakan, S. Muneer, K. Cil, L. Sullivan, Z. Jurado, J. Rarey, L. Zhang, R.
Nowak, M. Akbulut, Y. Zhu, C. Lam, H. Silva, A. Gokirmak, High temperature electrical
characterization of phase change material: Ge2Sb2Te5, in 2013 Materials Research Society
(MRS) Fall Meeting & Exhibit, Boston, MA, USA, 1–6 Dec 2013
62. F. Dirisaglik, K. Cil, M. Wennberg, A. King, M. Akbulut, Y. Zhu, C. Lam, A. Gokirmak, H.
Silva, Crystalization times of Ge2Sb2Te5 nanostructures as a function of temperature,” in 2012
American Physical Society (APS) March Meeting, Boston, MA, USA, 27 Feb–2 Mar 2012
63. N. Noor, K. Cil, L. Sullivan, S. Muneer, F. Dirisaglik, A. Cywar, C. Lam, Y. Zhu, A. Gokir-
mak, H. Silva, An experimental study on waveform engineering for Ge2Sb2Te5 phase change
memory cells, in 2015 Materials Reserch Society (MRS) Fall Meeting & Exhibit, Boston, MA,
USA, 29 Nov–4 Dec 2015
64. N. Noor, R.S. Khan, S. Muneer, L. Adnane, R. Ramadan, F. Dirisaglik, A. Cywar, C. Lam, Y.
Zhu, A. Gokirmak, H. Silva, Short and long time resistance drift measurement in intermediate
states of Ge2Sb2Te5 phase change memory line cells, in 2017 Material Research Society (MRS)
Spring Meeting & Exhibit, Phoenix, AZ, USA, 17–21 Apr 2017
65. F. Dirisaglik, G. Bakan, Z. Jurado, S. Muneer, M. Akbulut, J. Rarey, L. Sullivan, M. Wennberg,
A. King, L. Zhang, R. Nowak, C. Lam, H. Silva, A. Gokirmak, High speed, high temper-
ature electrical characterization of phase change materials: metastable phases, crystallization
dynamics, and resistance drift. Nanoscale 7(40), 16625–16630 (2015). https://doi.org/10.1039/
C5NR05512A
66. D. Ielmini, D. Sharma, S. Lavizzari, A.L. Lacaita, Physical mechanism and temperature accel-
eration of relaxation effects in phase-change memory cells, in 2008 IEEE International Relia-
bility Physics Symposium (IRPS), 27 Apr–1 May 2008, Phoenix, AZ, USA (Online). Available:
IEEE Xplore, https://ieeexplore.ieee.org/document/4558952. Accessed 21 Mar 2019. https://
doi.org/10.1109/RELPHY.2008.4558952
67. F. Dirisaglik, High-temperature electrical characterization of Ge2Sb2Te5 phase change mem-
ory devices. Ph.D. dissertation, Department of Electrical & Computer Engineering, University
of Connecticut, Storrs, CT, USA, 2014. http://digitalcommons.uconn.edu/dissertations/577/.
Accessed 21 Mar 2019
68. R.S. Khan, N. Noor, C. Jin, J. Scoggin, Z. Woods, S. Muneer, A. Ciardullo, P.H. Nguyen,
A. Gokirmak, M. van Dijk, H. Silva, Phase change memory and its applications in hard-
ware security, in Security Oppotunities in Nano Devices and Emerging Technologies, 1st ed.,
M. Tehranipoor, D. Forte, G.S. Rose, S. Bhunia (CRC Press, Boca Raton, FL, USA, 2017),
pp. 93–114
69. A. Pirovano, A. Redaelli, F. Pellizzer, F. Ottogalli, M. Tosi, D. Ielmini, A.L. Lacaita, R. Bez,
Reliability study of phase-change nonvolatile memories. IEEE Trans. Device Mater. Reliab.
4(3), 422–427 (2004). https://doi.org/10.1109/TDMR.2004.836724
70. D. Fugazza, D. Ielmini, S. Lavizzari, A.L. Lacaita, Random telegraph signal noise in phase
change memory devices, in 2010 IEEE International Reliability Physics Symposium (IRPS),
2–6 May 2010, Anaheim, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.
org/document/5488741. Accessed 21 Mar 2019. https://doi.org/10.1109/IRPS.2010.5488741
71. G. Betti Beneventi, A. Calderoni, P. Fantini, L. Larcher, P. Pavan, Analytical model for low-
frequency noise in amorphous chalcogenide-based phase-change memory devices. J. Appl.
Phys. 106(5), 1–8 (2009). https://doi.org/10.1063/1.3160332
72. L. Zhang, Z.H. Kong, C.H. Chang, PCKGen: a phase change memory based cryptographic key
generator, in 2013 IEEE International Symposium on Circuits and Systems (ISCAS), 19–23 May
2013, Beijing, China (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/
6572128. Accessed 21 Mar 2019. https://doi.org/10.1109/ISCAS.2013.6572128
73. D.-H. Kang, D.-H. Ahn, K.-B. Kim, J.F. Webb, K.-W. Yi, One-dimensional heat conduction
model for an electrical phase change random access memory device with an 8F2 memory cell
(F=0.15 µm). J. Appl. Phys. 94(5), 3536–3542 (2003). https://doi.org/10.1063/1.1598272
3 Phase Change Memory for Physical Unclonable Functions 91
Abstract With the widespread diffusion of ubiquitous mobile computing and inter-
net of things (IoT), secured communication and chip authentication become key
requirements. Hardware-based security concepts generally provide the best perfor-
mance in terms of good security standard, low power consumption, and large area
density. In these concepts, the stochastic properties of the device, such as the phys-
ical and geometrical variations of the process, are harnessed to generate random
bits and functions. This is the basis for the true-random number generator (TRNG),
where true-random numbers are generated by exploiting the physics and randomness
of nanoscale devices. The same random variations can also be used to implement
physical unclonable function (PUF) for the authentication of individual hardware
chips. Emerging memory devices rely on unique physical mechanisms for transport
and switching, thus appear as the ideal source of entropy for hardware TRNG and
PUF. These novel memory concepts include resistive switching memory (RRAM),
phase change memory (PCM), and spin-transfer torque magnetic memory (STT-
MRAM) devices. As these devices are increasingly adopted as memory and comput-
ing elements in several applications, exploiting their intrinsic stochastic variations
for TRNG and PUF becomes an attractive solution for low-cost, high-performance
security primitives. This chapter provides an overview of TRNG and PUF adopting
emerging memory devices as the fundamental entropy source. TRNG concepts are
classified by the microscopic stochastic variation that is adopted as entropy source,
namely, current noise, switching delay time, or switching voltage. While most TRNG
concepts rely on RRAM devices, we also show novel concepts using STT-MRAM
devices which take advantage of the excellent endurance and speed of switching.
The TRNG schemes are discussed in terms of the simplicity of the design, e.g., the
ability to generate random bits without a probability tracking by adopting a differ-
ential circuit scheme. Finally, the status of PUF implementations using RRAM and
their array circuits are presented and discussed.
4.1 Introduction
Information security has been a topic of intense research since the mid 1970s,
when the main purpose was to guarantee the confidentiality and integrity of data
within mainframe computers [1]. In more recent times, as mobile computers, inter-
net of things (IoT), and cloud computing are becoming ubiquitous, there is an ever-
increasing need for secure communication among them [2]. Portable devices such as
smartphones and tablets can now enable financial transactions and act as the primary
authentication token for the user. Therefore, there is a need for electronic chips to
(1) securely authenticate and be authenticated by other parties, (2) securely handle
private/sensitive information, and (3) operate in an untrusted environment where the
adversary might have physical access to the system [3]. These tasks must be imple-
mented in mobile devices at the level of integrated circuits (IC), featuring at the
same time both low power consumption and small area occupation. For application
in large-scale IoT [4] and cyber-physical systems (CPS) [5], security methodologies
must also feature high speed, low cost, and robustness to physical and side-channel
attacks [6].
Hardware-intrinsic security primitives such as true-random number generators
(TRNG) and physical unclonable functions (PUF) are gaining interest towards
low-cost and high-performance security tools [7]. On the one hand, TRNG can con-
veniently and efficiently generate the random bitstreams required by most of crypto-
graphic and security applications [8, 9]. On the other hand, PUF can securely store
a secret key in the random characteristics of an IC, by, e.g., exploiting the random
process fluctuations, and enabling fast and low-cost authentication and secure key
storage [3]. Nanodevices are currently considered as the most promising approach
for TRNG and PUF thanks to the small area, the low power consumption, the scal-
ability, the 3D integration, and the ability to offer intrinsic stochastic phenomena
via the inherent physical transport and switching mechanisms. These properties are
all extremely beneficial for portable and IoT applications. Nanoelectronics can pro-
vide scalable device concepts via either the well-established complementary metal
oxide semiconductor (CMOS) technology, or via alternative memory concepts based
on resistive, phase change, magnetic and ferroelectric materials [10], sometimes
referred to as the “memristive” concepts [11]. CMOS-based TRNG [8] and PUF
[12] were first introduced thanks to the strong integration capabilities and techno-
logical maturity. Nevertheless, they soon demonstrated a limited entropy quality and
the need for increased area and power overhead to improve randomness [13]. On
the other hand, memristive devices are currently gaining increasing interest for hard-
ware security thanks to their intrinsic stochastic behavior that can be harnessed for
high-performance and low cost, low-energy on-chip entropy sources.
This chapter provides an overview of the current state-of-the-art for both TRNG
and PUF implementations with resistive (memristive) switching devices. The focus
is on the applications of emerging memory technologies such as resistive switching
memory (RRAM) and spin-transfer torque magnetic memory (STT-MRAM), com-
bining binary stochastic switching, good endurance and scalable device area. The
4 Applications of Resistive Switching Memory as Hardware Security Primitive 95
chapter is organized as follows: Sect. 4.2 describes the general framework of hardware
security primitives such as TRNG and strong PUF. Section 4.3 is a short overview of
the RRAM device, including the device structure and the switching mechanisms. Sec-
tions 4.4–4.8 presents the possible approaches toward RRAM-based TRNG, based
on the stochastic phenomena in RRAM devices such as current noise (Sect. 4.4),
switching time variability (Sect. 4.5) and switching voltage variability (Sect. 4.6).
Section 4.7 presents TRNG schemes based on differential pairs of RRAM, while
Sect. 4.8 illustrates STT-MRAM-based TRNG concepts. Section 4.9 reviews recently
presented PUF concepts based on the RRAM technology. Finally, Sect. 4.10 provides
a summary and an outlook for the research and development of hardware security
using RRAM devices.
process [25]. It has been demonstrated that the high unpredictability of hardware-
based TRNG makes them more reliable with respect to software-based PRNG sys-
tems [26, 27]. In recent years, various physical entropy sources were proposed for
TRNG, like the random telegraph noise (RTN) in dielectrics [28, 29] , stochastic
quantum processes [25], stochastic spintronic phenomena [30, 31], and memristive
transport and switching [32–34]. Several stochastic entropy sources were identified
in both CMOS technologies and emerging memristive devices.
CMOS-based TRNGs were demonstrated by exploiting the noise in scaled MOS-
FET [28], the metastability at turn-on of cross-coupled inverter pair (namely, SRAM
core) [8], or the increased noise of dual drain MOSFET driving a voltage-controlled
oscillator (VCO) [35]. All these concepts take advantage of the mature integration
capability of CMOS logic chips. However, CMOS-type TRNGs suffer from vari-
ous drawbacks: for instance, a colored noise spectrum, e.g., due to capture/emission
events originating 1/f noise, results in a biased output bitstream, requiring consid-
erable post-processing and a consequent circuit overhead. Noise in CMOS devices
also critically depends on environmental/process fluctuations, whose impact can be
minimized only with entropy-tracking feedback loops [8], thus resulting in additional
power consumption, circuit area and added complexity. On the other hand, memris-
tive concepts such as RRAM and STT-MRAM enable ultra-small entropy source
with high-quality randomness, which makes these technologies very promising for
TRNG.
Fig. 4.1 Schematic representation of the current information security scenario with some applica-
tions of hardware security primitives, comprising data cryptography, hardware (HW) authentication
and stochastic/neuromorphic computing. The main building blocks are the random number gen-
erator (RNG) and the physical unclonable function (PUF). The RNG can be implemented either
as a pseudo-random number generator (PRNG), such as the linear-feedback shift register (LFSR)
in the figure, or as a true-random number generator (TRNG), which harnesses the stochasticity of
physical phenomena (like the noise in the current trace in the figure) to generate a random bitstream.
A PUF systems introduces a challenge (c) response (r) relation, where r = f (c) and f (·) is given by
the physical details defining that specific PUF instance. In the figure, a typical SRAM-based PUF
circuit is schematically shown. Adapted from [32]
A PUF system can be represented as a black box that for each input challenge c
returns an output response r = f (c), with f describing the unique internal physical
characteristics of the PUF (Fig. 4.1). The set of possible challenge-response pairs
(CRPs) defines a particular PUF instance.
Depending on the number of unique CRPs, there are two main categories of PUFs:
the weak PUF, which can only support a relatively small number of challenges, and
the strong PUF, with an extremely large set of CRPs [3]. More specifically, in a
weak PUF the number of CRPs increases linearly or polynomially with the number
of basic cells, i.e., the building blocks forming the PUF system [39], while the
number of CRPs increases exponentially in a strong PUF [12]. The weak PUF is often
referred to as physical obfuscated key (POK), since its primary task is the generation
or storage of a cryptographic key [40, 41]. The most popular implementation of
the PUF circuit is based on the digital static random-access memory (SRAM), and
exploits the metastable states of cross-couple inverters [42]. In each inverter pair, the
response bit is determined by which of the two nominally equal-sized inverters of the
memory cell addressed by the challenge reaches the tri-state point faster. Memory-
based PUFs (POKs) are relatively easy to design even with low area overhead. Such
memory-based systems are essentially weak PUFs since the set of CRP is limited
by the available memory capacity [43]. As a result, their CRP set can be completely
explored within polynomial time, compromising their use as identification tools. On
the other hand, given their large CRPs, strong PUFs are practically immune from
brute-force attacks [12] and are therefore suitable for low-cost authentication.
Although there is no general metric to certify a PUF system in terms of security
properties, the following characteristics can be considered as the best figures of merit
(FOM) for PUF [12]:
98 R. Carboni and D. Ielmini
• Reliability: A PUF should always give the same response to a given challenge over
a wide range of operating conditions (voltage, temperature, etc.)
• Unpredictability: The PUF response to an arbitrary challenge should not be pre-
dicted based on the CRPs of another PUF or from the previous CRPs of the same
PUF.
• Unclonability: The CRP mapping of a PUF cannot be physically or mathematically
cloned, even for the original manufacturer of the PUF.
• Physical Unbreakability: Any physical attempt to maliciously modify the PUF
should result in a malfunction or a permanent damage of the chip.
The practical evaluation of these FOMs for a specific PUF system is not straightfor-
ward in general, as discussed in Sect. 4.9. Although extremely promising for low-cost
chip authentication, the PUF should be strong enough against attacks aiming at build-
ing a model for the PUF. This kind of attacks try to develop a model of a PUF instance
by looking at a subset of its input–output pairs. Among these, the machine-learning
attacks have been demonstrated to be particularly successful [44, 45]. The careful
co-design of the stochastic memory and the circuit-dependent function is therefore
essential for developing strong PUFs for hardware security.
Among the emerging memory technologies, RRAM is one of the most promising due
to its nonvolatile retention, fast switching, low power, and CMOS compatibility [46–
49]. The RRAM integration in crosspoint arrays, in the back-end-of-line (BEOL),
and adopting 3D structures allows for increased density and easy of integration [50–
53]. Figure 4.2a shows a bipolar RRAM device, comprising a HfO2 switching layer
sandwiched between a TiN bottom electrode (BE) and Ti top electrode (TE). The Ti
layer at the TE acts as an oxygen exchange layer inducing the generation of oxygen
vacancies in the oxide layer, which thus become HfOx with x < 2 [34]. RRAM devices
are usually integrated into a one-transistor/one-resistor (1T1R) structure to enable the
control of the resistance level by limiting the current flowing in the select transistor
during the set transition. Figure 4.2b shows the current–voltage (I–V) characteristics
of the RRAM device, where the application of a positive voltage to the TE causes a
set transition from the high resistance state (HRS) to the low-resistance state (LRS)
in correspondence of the set voltage Vset . The application of a negative voltage to
the TE induces instead a reset transition from the LRS to the HRS in correspondence
of the reset voltage Vr eset . The resistance window between the LRS and the HRS is
at least one order of magnitude, but can reach 5 orders of magnitude by the adoption
of high band gap dielectrics such as SiOx [54]. A relatively small gate voltage is
applied during the set transition to limit the compliance current IC across the device,
thus allowing to control the LRS resistance according to R = VC /IC , where VC is
a characteristics voltage generally lower than 1 V [55]. The LRS resistance can be
thus controlled by the parameter IC , while the HRS resistance can be controlled by
4 Applications of Resistive Switching Memory as Hardware Security Primitive 99
Fig. 4.2 Schematic of a typical 1T1R structure comprising a RRAM cell integrated on top of the
drain of an integrated MOSFET (a). In this example, the RRAM stack includes a Si-doped HfOx
switching layer, a Ti top electrode (TE) and a TiN bottom electrode (BE). The corresponding I–V
characteristics shows the definition of the set voltage Vset , the compliance current IC , the reset
voltage Vr eset , and the stop voltage Vstop . Reprinted with permission from [34]. Copyright (2016)
IEEE
the maximum negative voltage along the reset sweep, namely the stop voltage Vstop
[54].
RRAM switching is caused by ionic migration induced by the voltage and the local
Joule heating [56]. Because of the atomistic nature of the switching and local impact
of microstructure, such as crystalline grain boundary and orientation, the set and
reset transitions are characterized by a significant random variation [57]. The local
conduction path does not only change during set and reset operations, but is also prone
to stochastic atomistic fluctuations such as defect relaxation and diffusion which can
cause a significant variation in the read current after the programming pulse [58, 59].
RRAM variations thus affect both the device-to-device consistency within a memory
array [60], and the cycle-to-cycle variations within the same device because of the
several different defect configurations. Variations can affect all RRAM parameters,
including the LRS and HRS resistance, the set voltage Vset and the reset voltage
Vr eset . While stochastic variations are critical in hindering memory and computing
application of RRAM [60], they offer the physical source of entropy that is needed
for hardware security primitives.
Stochastic phenomena in RRAM devices can be exploited as entropy source for
TRNG. RRAM schemes for TRNG can be grouped in three classes according to
Fig. 4.3, where the sources of entropy are (a) stochastic noise, (b) stochastic switching
time, or (c) stochastic switching voltage [10].
100 R. Carboni and D. Ielmini
The fluctuation of a bistable defect within the RRAM conduction path in either
the LRS or HRS can lead to a relatively large fluctuation of the current between
two levels called random telegraph noise (RTN, see Fig. 4.3a) [61]. RTN induces a
random change in the read current between a low value I0 and a high value I1 . By
sampling the current trace in Fig. 4.3a, one obtains a bimodal distribution of currents
in Fig. 4.3b which can be used to assign the random bits “0” and “1”.
The current fluctuation in RTN can be ascribed to the modification of the charge
state of a bistable defect close to the conductive path, due to, e.g., electron trapping
and detrapping combined by a structural relaxation of the defect. The charge state
affects the carrier concentration close to the defect, thus resulting in a macroscopic
change of the measured current [58]. As the filamentary path diameter of the LRS
becomes smaller, the impact of the individual defect increases markedly, which is
generally evidenced by the difference between the two resistance values ΔR increas-
ing with the square of the average resistance (ΔR ∼ R2 ) [58, 61]. This is similar to
the RTN affecting the channel current in a MOS transistor, resulting from a bistable
fluctuation of the charge state of an oxide defect. As the RTN amplitude can be quite
significant, it can be exploited as an entropy source in RRAM devices.
Fig. 4.3 Random telegraph noise current fluctuations (a) and corresponding probabilistic distri-
bution function (PDF) (b). In (c) the applied voltage pulse and its corresponding current response
evidencing the random delay time Δt, and PDF of Δt (d) with an equally spaced time window to
uniformly attribute bit values 0 and 1. Measured I–V curves evidencing cycle-to-cycle variation
of Vset (e), and PDF of the resistance measured after a stochastic set (f), where sub-distributions
of the high-resistance state and the low-resistance state are attributed to bits 0 and 1, respectively.
Reprinted with permission from Macmillan Publishers Ltd: Nature Electronics [10]. Copyright
(2018)
4 Applications of Resistive Switching Memory as Hardware Security Primitive 101
Fig. 4.4 a Measured I–V characteristics for negative voltage showing RTN. b Measured read
current as a function of time for read voltage Vr ead = 50, 200 and 350 mV and (b) corresponding
simulations. The RTN switching times Δ t O N and Δ t O F F decrease with Vr ead . Reprinted with
permission from [58]. Copyright (2014) IEEE.
Fig. 4.5 a Schematic representation of the TRNG block diagram, including CRRAM, comparator
and clock control circuit. b Comparator output, showing binary random digital behavior. Reprinted
with permission from [29]. Copyright (2012) IEEE
To understand the impact of RTN on device behavior, Fig. 4.4a shows the mea-
sured current–voltage (I–V) characteristics for a RRAM device with HfOx switch-
ing layer. The current trace was measured at negative voltage in the LRS state and
clearly evidences discrete transitions, typical of RTN. Data show that RTN transition
rate increases at higher Vr ead , which can be better understood by constant-voltage
measurements of current as a function of time in Fig. 4.4b. Here, the average rate of
switching between the two RTN states increases with the read voltage for Vr ead = 50,
200 and 350 mV. Conversely, the average time for which the current remains high
(Δt O N ) and the time for which the current remains low (Δt O F F ) both decrease at
increasing Vr ead . The same behavior can be seen in the numerical simulations of
Fig. 4.4c obtained with a finite-element method (FEM) numerical model of RTN
[58]. The voltage dependence of RTN can be understood as the acceleration of RTN
fluctuation kinetics due to the voltage induced Joule heating. Similarly, RTN can be
accelerated at high ambient temperature [58]. Figure 4.5 shows the architecture of
a TRNG circuit exploiting RTN in RRAM [29]. The TRNG is based on a contact-
resistive random-access memory (CRRAM), integrated on the drain contact of a
102 R. Carboni and D. Ielmini
Fig. 4.6 a Measured read current as a function of time for a device in LRS with R = 10 kΩ
and b corresponding relative standard deviation σ I /I. c and d Calculated current versus time and
corresponding relative standard deviation. e PSD of experimental and calculated noise, showing a
1/f behavior. Results from the analytical model of [59] are reported in (b), (c) and (e). Reprinted
with permission from [59]. Copyright (2015) IEEE
MOS transistor with a 1T1R structure. The 1T1R structure is biased with a voltage
V R , thus any RRAM fluctuation due to RTN results in a fluctuation of the voltage V D
at the transistor drain. The drain potential V D is compared to a reference voltage Vr e f
by an integrated comparator (Fig. 4.5a), leading to a binary random digital output
as shown in Fig. 4.5b. Sampling the digital output at increasing times with a clock
frequency fC K leads to a sequence of random bits provided fC K f RT N , where
f RT N is the average rate of RTN fluctuations.
Although the scheme is very simple, the TRNG of Fig. 4.5 has few issues related
to both the physical concept and the circuit. The circuit has been reported to have
a relatively large area, namely 2400 F2 in 65 nm technology, i.e., 10 µm2 [29].
Practical TRNG based on RTN phenomena are also affected by a difficult control
of amplitude, rate, and uniformity of the physical RTN. In fact, an unbiased RNG
with equal 50% probability of generating either a “0” or a “1” is obtained only if
the I0 and I1 sub-distributions in Fig. 4.3b have the same area. Also, as previously
described, RTN is affected by temperature and voltage, leading to instabilities of
the RTN entropy source. The amplitude of the RTN should be large enough to be
distinguishable by the comparator stage, while the reference voltage Vr e f needs to be
adjusted carefully depending on the specific level of the resistance and its fluctuation.
The nonuniformity of the “0”/“1” balance in the output bitstream can be compensated
by a digital post-processing such as the von Neumann algorithm, however, this comes
to the expense of an additional circuit area and power overhead.
Most recently, to compensate for the area occupation and other issues of the TRNG
circuit of Fig. 4.5, the current difference in the 1/fβ noise of the RRAM device was
used as the entropy source [32]. The RRAM noise is associated to multi-trap capture
and emission events in defects (e.g., oxygen vacancies) along the conductive filament
(CF) in LRS and the localized conductive path in HRS [59]. Figure 4.6a shows the
read current Ir ead measured for a RRAM in the LRS with an average R = 10 kΩ,
biased with a read voltage Vr ead = 10 mV. Current fluctuations due to the 1/f noise
result in an increasing relative standard deviation σ I /I, where I is the average value of
4 Applications of Resistive Switching Memory as Hardware Security Primitive 103
Fig. 4.7 a Conceptual representation of the entropy harvesting algorithm for TRNG. b Block
diagram of the parallel TRNG circuit, which allows for a 32 Mbps random bitstream. c Minimum
entropies are higher than 0.999 over broad range of operative temperature and voltages. d P-value
of 1000 sequences of 1 Mb bitstreams for the frequency test. Reprinted with permission from [32].
Copyright (2016) IEEE
Ir ead at any time t, while σ I is its standard deviation (Fig. 4.6b). The increasing value
of σ I /I with the time is due to the increasing noise contributions at low frequency,
which is typical of 1/f behavior of noise. The simulation results by a numerical Monte
Carlo model of 1/f noise in Fig. 4.6c and d show similar behavior [59]. Figure 4.6e
shows the measure and calculated power spectral density (PSD) (S I ), evidencing a
clear -1 slope, typical of the 1/f noise.The 1/f noise can be harvested for TRNG by the
circuit shown in Fig. 4.7 [32]. Here, the noisy current is sampled at subsequent times
t and t + Δt, then the two sampled currents are subtracted leading and the difference
ΔI is compared to 0. Finally, the random bit value is assigned to 0 or 1 depending
on ΔI being positive or negative, respectively. With respect to the RTN scheme of
TRNG, the differential scheme allows both for a reduced area of 0.256 µm2 (or 160
F2 in 40 nm technology) and a reduced bias in the probability of extracting a “0” or a
“1” bit. In fact, the differential current ΔI (Fig. 4.7a) follows a Gaussian distribution,
thus ensuring that “0” and “1” have exactly the same probability of 50%. The circuit
design (Fig. 4.7b) allows for a precise current value extraction using a timing sense
amplifier (TSA) and a resistance-to-time converter (RTC) [62], while the parallel
configuration of multiple devices enables up to 32 Mbps operation, with a 0.04 nJ/bit
104 R. Carboni and D. Ielmini
energy efficiency. Test results are finally reported by showing a minimum entropy
higher than 0.999 over a broad range of temperature (−40 < T[◦ C] < 120) and with
different voltages (V D D = 0.1 V) (Fig. 4.7c). The high performance of the scheme is
further demonstrated by the P-value, i.e., a FOM for randomness of the random bit
stream, of 1000 groups of 1 Mb bitstream for the frequency NIST 800-22 test [63]
(Fig. 4.7d).
Fig. 4.8 Distributions of switching time delay for applied voltage of 2.6 V (a), 3.2 V (b), and 3.6 V
(c), with their corresponding fitting with the Poisson distribution. The only fitting parameter was
τ = 15.3, 1.2 and 0.029 ms for figure (a), (b) and (c), respectively. d Shows the voltage dependence
of τ . Reprinted with permission from [69]. Copyright (2008) American Chemical Society
in Fig. 4.8d decreases exponentially with the applied voltage, thus reflecting the
decrease of the effective energy barrier E A with the applied voltage [56, 64]. Data
in Fig. 4.8d highlights that, although the single switching event is stochastic, the
overall distribution of switching times can be predicted and controlled by the applied
voltage [68, 69].
The stochastic delay time was adopted as the entropy source for TRNG by the
circuit shown in Fig. 4.9a [66]. The proposed TRNG consists of a volatile RRAM
device with Ag TE and Ag-doped SiO2 dielectric layer. In this type of devices, the
Ag migration from the TE results in the formation of an unstable CF, which decays
soon after the set transition with a retention time ranging from few µs to few ms
[70–73]. The volatile behavior is due to the large diffusivity of Ag combined with
106 R. Carboni and D. Ielmini
Fig. 4.9 a Schematic representation of the TRNG circuit block diagram, comprising a memristive
device, a comparator, an AND gate, and a counter. b Pulsed waveforms at each stage of the circuit,
explaining the working principle of the TRNG. Reprinted from [66]. Creative Commons (2017)
the mechanical compressive stress in the dielectric layer [74] and the tendency to
minimize the surface to volume ratio of the CF [71]. In the TRNG circuit of Fig. 4.9a,
the volatile RRAM device is connected with a series resistance in a voltage-divider
configuration. The potential V2 of the intermediate node of the voltage-divider serves
as the input of a voltage comparator. The comparator output and a clock pulse serve
as the input of an AND gate, and a counter reads the AND output. A TRNG cycle is
shown in Fig. 4.9b: the application of a voltage pulse V1 (1) causes a set transition
in the RRAM device after a stochastic Δt, which causes V2 to suddenly increase
above the reference Vr e f (2), thus making the comparator output go to a high logic
level V3 (3). Due to the stochastic Δt, the V3 pulse has a random duration, which is
measured by the counter in units of the clock period TC L K . Note that the binary bit
(6) flips between 0 and 1 for the whole duration of the V3 pulse, eventually resulting
in a random bit. Note that a nonvolatile RRAM could be adopted in this scheme as
well, however, a reset pulse would be needed to reinitiate the device for a new cycle.
The use of volatile RRAM in this case makes the TRNG algorithm easier and more
energy efficient, as no reset pulse is needed.
4 Applications of Resistive Switching Memory as Hardware Security Primitive 107
To match the time window TC L K < Δt < t P , the pulse voltage V1 should be
carefully tuned, which usually requires complicated probability tracking techniques
[75]. Also, extracting entropy from the stochastic switching time can be difficult due
to its sensitivity on device parameters and process variations, requiring a probability
tracking of the applied voltage for every TRNG on the same chip, or in separate chips
[76].
A promising and more robust TRNG relies on the exploitation of the stochastic
switching voltage. Namely, instead of measuring the delay time Δt for switching,
one can monitor the device for a given amount of time, where the switching probabil-
ity becomes the stochastic entropy source. This approach is schematically depicted
in Fig. 4.3e, where various current–voltage characteristics measured on the same
RRAM device demonstrate a distribution of set voltage (Vset ), due to the cycle-to-
cycle variation. The application of a voltage equal to the average transition voltage
<Vset > to the device in the HRS will then lead to a set transition with 50% prob-
ability. As a result, the measured resistance of the device after the applied voltage
pulse then shows a bimodal distribution as indicated in Fig. 4.3f, where the two
sub-distributions correspond to LRS and HRS. The random bitstream can thus be
generated by associating the LRS and HRS to bit values “0” and “1”, respectively
[33]. A similar scheme can be extended to stochastic computing, where an ana-
log value can be obtained as the sequence of stochastic bimodal resistance values
obtained from the same device [68]. To illustrate the voltage-based TRNG con-
cept, Fig. 4.10a shows the measured I–V curves for the same RRAM device with
1T1R configuration for six successive set/reset cycles [33]. The switching parame-
ters, such as set and reset voltages, and the HRS and LRS resistance values show
a large variability from cycle to cycle, which can be explained by considering the
physics of the random formation and disruption of the conductive filament [57].
Figure 4.10b shows the pulse sequence for characterizing the random set transition
process, including: (1) a positive set pulse to deterministically initialize the device
in LRS, (2) a negative reset pulse with a stop voltage Vstop to induce transition to
the HRS, (3) a positive set pulse with voltage (V A ) close to <Vset > to stochastically
induce a set transition event, and (4) a read pulse to measure the resistance in the
final state. Figure 4.10c shows the resulting resistance distribution for a random set
experiment with V A = 1.6 V. Data shows a bimodal distribution, corresponding to
LRS sub-distribution with R ≈ 12 kΩ and HRS sub-distribution above 100 kΩ.
The origin of the bimodal distribution is clarified in Fig. 4.10d, which shows three
characteristic I–V curves for various stochastic events, corresponding to state A,
B, and C in Fig. 4.10c. Case A corresponds to a cycle where Vset was higher than
the applied V A , due to a relatively high HRS after the reset pulse. As a result, no
set process took place in this case, thus the resistance was found in the HRS sub-
distribution (Fig. 4.10b). Case C corresponds to Vset being smaller than V A , thus
108 R. Carboni and D. Ielmini
Fig. 4.10 a Measured I–V characteristics for six cycles on the same 1T1R structure, evidencing
stochastic switching. b Sequence of applied pulses for TRNG, with (c) the cumulative distribution
of read resistance. Random set process is highlighted in the three I–V curves (d), corresponding to
states A, B and C in (c). Reprinted with permission from [33]. Copyright (2015) IEEE
Fig. 4.11 Measured resistance for 500 random set cycles with Vstop = −1.45 V and V A = −1.6 V
(a), correlation of R in cycle i+1 as a function of R in cycle i (b) and (c) population of the four
regions in (b). Reprinted with permission from [33]. Copyright (2015) IEEE
of correlation across two consecutive cycles, which is consistent with true random-
ness of the bit stream. To guarantee proper RNG operation, a positive-feedback
regeneration of the analog output values might be required. Figure 4.12a shows a
compact regeneration circuit [33], comprising a RRAM device in 1T1R structure as
the first stage and a CMOS inverter as the second stage. This scheme takes advan-
tage of the relatively large resistance window between LRS and HRS, thus allowing
the use of a small CMOS inverter instead of the larger analog comparator, which is
instead typically required for recovering the small signal in RTN-based RNG [29].
Figure 4.12b shows the Vin –Vout characteristics of the CMOS inverter, evidencing
the high gain in the transition region (with a threshold voltage VT = 0.4 V) which
allows for digital restoration. The impact of this regeneration circuit on the random
bit distribution is illustrated in Fig. 4.13, showing measured and simulated bimodal
resistance distributions (a), the simulated digital bimodal distribution of the inverter
output Vout (b) and the sequence of the output voltage Vout for 2 × 105 cycles (c).
To achieve a sufficient uniformity of the generated random bits, the applied voltage
should be finely tuned to match the exact value <Vset >. This requires a preliminary
110 R. Carboni and D. Ielmini
To overcome the need for a probability tracking in voltage-based TRNG, various dif-
ferential schemes have been recently developed [34, 75]. In these TRNGs, either the
competition between two RRAM devices [34] or the comparison between consecu-
tive cycles on the same device [75] yields high-quality entropy without probability
tracking, thus with a relatively simple circuit layout. A typical differential scheme
relies on the coupling of two RRAM devices in either series or parallel configurations
with the entropy source being the variability of set or reset transitions [34]. Three
different schemes were proposed, namely: (a) parallel reset, (b) series reset and (c)
parallel set, as detailed in the following [34]. Figure 4.14a shows the parallel-reset
TRNG circuit, comprising two RRAM cells, referred to as P and Q, connected in
parallel. The common BE is connected to a comparator for the differential read.
Figure 4.14b shows the waveform applied to the TE of devices P and Q, i.e., V P
and V Q , respectively, and the voltage Vout of the common BE node between P and
Q. During a TRNG cycle, the applied waveforms include three phases, namely, (1)
a positive voltage is applied across both P and Q in parallel, inducing set transition
at both devices, (2) a negative voltage is applied across P and Q in parallel, induc-
4 Applications of Resistive Switching Memory as Hardware Security Primitive 111
Fig. 4.14 a Parallel-reset differential scheme for TRNG and b sequence of applied signals. Both
P and Q start in HRS and are independently set, then reset and finally read using a voltage-divider
configuration. The analog comparator (CMP) digitally restores the output signal. Reprinted with
permission from [34]. Copyright (2016) IEEE
ing reset transition in both devices, (3) a differential read phase where +Vr ead and
−Vr ead are applied at P and Q with floating BE to test the voltage divider between P
and Q. Depending on the resistance values of P and Q, namely R P and R Q , respec-
tively, the output voltage is found to be positive or negative, thus dictating the value
of the output random bit. Given the relatively large variability of the HRS resistance
112 R. Carboni and D. Ielmini
[33, 57], Vout varies stochastically from cycle to cycle, thus constituting the basis
for random bit generation. In this first approach, HRS resistance variation acts as
the entropy source. Note that the bit value probability is automatically set to 50%
by the uniform cycle-to-cycle distributions of HRS resistance of P and Q, as the
cycle-to-cycle variation in RRAM is comparable to the cell-to-cell variation [77].
Figure 4.15a shows the cumulative distributions of measured and calculated R P
and R Q , both after set and after reset. The read Vout distributions are shown in
Fig. 4.15b for experimental and calculated data, indicating a bimodal shape with
50% transition probability. By reading the voltage Vout with an analog comparator
(Fig. 4.14a), the bimodal distribution can be improved, as shown by the distribution
Fig. 4.15 Cumulative distributions of resistance after set and after reset for cell P and Q (a).
b Distributions of the output voltage Vout and Vout2 , before and after the CMP, respectively. c
Measured Vout and Vout2 for 1000 RNG cycles with the corresponding PDFs (d). Reprinted with
permission from [34]. Copyright (2016) IEEE
4 Applications of Resistive Switching Memory as Hardware Security Primitive 113
Fig. 4.16 a Series reset differential scheme for TRNG and b sequence of applied signals. From the
HRS, the cells are independently set, then they undergo a random reset, during which only one can
switch, and finally they are read in voltage-divider configuration. Reprinted with permission from
[34]. Copyright (2016) IEEE
of the comparator output Vout2 in Fig. 4.15b. The bulky comparator may be replaced
by a CMOS inverter, thus reducing the on-chip area occupation [33]. To demonstrate
the cycle-by-cycle operation of the parallel-reset scheme, Fig. 4.15c shows Vout
and Vout2 for 1000 consecutive cycles, while Fig. 4.15d shows their corresponding
probability density function. The TRNG does not require any probability tracking
thanks cycle-to-cycle variability being comparable to the cell-to-cell variability [77].
Figure 4.16a shows an alternative differential TRNG scheme, namely the serial reset
configuration. This comprises two RRAM devices connected in series with V P and
V Q as supply voltages and the intermediate node of potential Vout connected to an
output comparator. Figure 4.16b shows the applied waveform of V P , V Q and Vout
during a TRNG cycle, consisting of (1) independent set of P and Q, (2) random reset
of either P or Q, (3) differential read of Vout . For simplicity, we assumed V Q = −V P
in the figure. During the random reset event, a negative voltage V P − V Q < 0 is
applied to the two devices in series, while the common node is left floating. A total
applied voltage |V P − V Q | > 2 Vr eset drops across the devices, thus inducing reset
transition in one of the two devices. In fact, once the transition begins in one of the
two cells, the voltage across it increases because of the voltage-divider effect, while
the voltage drop across the other device decreases, thus preventing the two devices
to both undergo reset transition. This configuration thus realizes a positive feedback,
resulting in a self-accelerated reset event that takes place randomly in one device
only. Specifically, the reset transition takes place in the device with the smallest
Vr eset . Because of the cycle-to-cycle variability of Vr eset , the probability for one
device to reset is ideally 50% [57]. Figure 4.17a shows the cumulative distribution
of R P and R Q after set and reset pulses in Fig. 4.16b [34]. After the random reset
pulse, both P and Q show the same bimodal distribution with transition point at
50% probability, thus demonstrating unbiased TRNG with no need for probability
tracking. To gain further insight on the random reset process, Fig. 4.17b shows the
correlation plot of R Q as a function of (R P ) after either set or reset. R P and R Q
appear to be anti-correlated after the reset phase, namely R P is high for low R Q
and vice versa, which thus reveals a conditional reset of one RRAM device only.
114 R. Carboni and D. Ielmini
Figure 4.17c shows the distributions of experimental and calculated Vout , indicating
a bipolar mode with transition point at 50% probability. Similar to other TRNG
schemes, a digital regeneration can be obtained by a comparator or a CMOS inverter.
Figure 4.17d shows the cycle-to-cycle values of Vout and Vout2 during the application
of the RNG pulse scheme of Fig. 4.16b. Note that after each differential read phase,
a final deterministic reset pulse was applied to ensure equal HRS conditions in P
and Q before the application of the set pulse. Figure 4.17e shows the corresponding
distributions of Vout and Vout2 for both data and calculations [34]. Figure 4.18a shows
the parallel set scheme [34], where the two RRAM devices in parallel configuration
are connected to a common select transistor, with the drain terminal connected to
the input node of a comparator. Figure 4.18b shows the applied waveform cycle,
including (1) an independent reset of P and Q, (2) a random set pulse of P and Q, and
(3) a differential read by the application of a voltage 2 Vr ead across the two devices,
while the transistor is biased in the off state. This TRNG scheme is based on the
one-transistor/two-resistor (1T2R) structure in Fig. 4.18a, where the application of a
positive voltage across the devices causes set transition to take place randomly in one
of the two devices first. As a result of the transition to LRS and the voltage-divider
effect with the transistor, the voltage drops across both devices, which prevents any
set transition to take place in the second RRAM device. In this TRNG scheme, the
cycle-to-cycle variability of Vset plays the role of entropy source. Figure 4.19a shows
the read resistance distributions for P and Q, evidencing the expected bimodal shape
with HRS/LRS transition at 50%. In order to verify that the random set happens
stochastically in either one of the devices, Fig. 4.19b shows the correlation plot of
R Q as a function of R P , again indicating an anti-correlation where P is in HRS for Q
in LRS, and vice versa. Finally, Fig. 4.19c shows the cycle-to-cycle output values of
Vout and Vout2 , while Fig. 4.19d shows their corresponding probability distributions.
Comparing these solutions for entropy harvesting, different performances are
apparent in terms of bimodal distribution of R and Vout . For instance, the parallel-set
TRNG (Fig. 4.19) shows improved results with respect to the parallel-reset TRNG
(Fig. 4.15). This can be understood considering the abrupt set transition in the parallel
set process as opposed to the more gradual reset event in the parallel-reset process.
The abrupt set transition is explained by the physical positive feedback where the
first initiation of the filament causes an increase of the local Joule heating, thus
accelerating the further growth of the filament [57]. This highlights the key role of
the physics of the entropy-generating process has in controlling the quality of the
TRNG circuit.
A general drawback of the differential pair approach is the assumption that cycle-
to-cycle variation dominates over the cell-to-cell variation. In presence of a large
mismatch between the two cells in the differential pairs, e.g., where one cell system-
atically displays a lower Vset than the other cell, the TRNG might show deviations
from the uniform behavior. Although this might be acceptable for PUF applications,
where the random unique key has to be generated only once in the lifetime of the
device, it might cause non-acceptable nonuniformities in TRNG [34].
4 Applications of Resistive Switching Memory as Hardware Security Primitive 115
Fig. 4.17 a Cumulative distributions of R after set and after reset for both cells P and Q. b Corre-
lation plot of R Q as a function of R P . c Cumulative distributions of Vout and Vout2 . d Measured
Vout and Vout2 during RNG cycling and e corresponding PDF. Reprinted with permission from
[34]. Copyright (2016) IEEE
Fig. 4.18 a Parallel set differential scheme and b sequence of applied signals. From the LRS, the
cells are first independently reset, the subjected to parallel set, and finally read with voltage-divider
configuration. Reprinted with permission from [34]. Copyright (2016) IEEE
116 R. Carboni and D. Ielmini
Fig. 4.19 a Cumulative distributions of R after set and after reset for P and Q. b Correlation plot
of R Q as a function of R P after set and reset. c Cumulative distributions of Vout and Vout2 during
RNG cycling, and e corresponding measured Vout and Vout2 PDF. Reprinted with permission from
[34]. Copyright (2016) IEEE
The presented TRNG schemes can be adopted for all stochastic memory devices, e.g.,
the phase change memory (PCM) or the STT-MRAM. In particular, STT-MRAM
offers improved cycling endurance [78] and fast switching [79] which might benefit
the TRNG operation by providing an extended lifetime and throughput. Figure 4.20a
shows a typical state-of-the-art STT-MRAM device, consisting of a magneto-tunnel
junction (MTJ) with perpendicular magnetic anisotropy (PMA) [78]. The MTJ con-
sists of a pinned layer (PL) and a free layer (FL), acting as bottom electrode (BE) and
top electrode (TE), respectively, and both made of ferromagnetic CoFeB. Between
the two electrodes, a dielectric layer made of crystalline MgO serves as the tunnel-
ing barrier to induce the MTJ effect [80]. As schematically shown in Fig. 4.20b, this
memory device has two stable states, where the magnetic polarization of the FL can
be either parallel (P) or antiparallel (AP) to the magnetization of the PL, resulting
in low or high resistance of the MTJ, respectively [78, 80]. Figure 4.20c shows the
4 Applications of Resistive Switching Memory as Hardware Security Primitive 117
Fig. 4.20 a Typical STT-MRAM device, consisting in a magnetic tunnel junction (MTJ) stack. b
Energy as a function of the FL magnetic polarization direction with respect to the PL, showing P
and AP states. c Measured and calculated I–V and d R–V pulsed characteristics with 1 µs pulse
width. Reprinted with permission from [75]. Copyright (2018) IEEE
Fig. 4.21 a Measured rectangular voltage pulses and current response for 2 consecutive cycles
n−1 and n, b PDF of the integrated current Qn and c PDF of differential charge ΔQn = Qn − Qn−1 .
The pulse sequence includes positive and negative rectangular pulses for stochastic set and reset
transitions, respectively, as evidenced by the abrupt steps in the current response. The random bit
is assigned from the value of ΔQn in (c). Reprinted with permission from [75]. Copyright (2018)
IEEE
issues, a novel differential concept was presented, where the consequent switch-
ing cycles are compared in the same device, instead of two coupled devices [75].
Figure 4.21a shows the applied voltage and the device current response over two con-
secutive set/reset cycles. In each cycle, a stochastic pulse with positive V+ is applied,
followed by a deterministic pulse with negative V− . Both pulses have a pulse duration
of 1 µs, although the concept can be easily scaled to a shorter pulse width thanks to
the high speed of the switching process in the STT-MRAM. The stochastic switch-
ing is evidenced in Fig. 4.21a, where a shorter delay time tset is observed during
cycle n−1 with respect to cycle n. the TRNG relies on the comparison between the
current responses between two consecutive cycles of the same STT-MRAM device.
Figure 4.21b shows the probability distribution of the integrated current Q n = idt
while Fig. 4.21c shows the corresponding difference ΔQ n = Q n − Q n−1 . Given the
highly symmetric distribution of ΔQ n , the latter is chosen as the statistical variable
for random bit generation, where a random bit value 0 or 1 is assigned for ΔQ n < 0
or ΔQ n > 0, respectively [75].
Figure 4.22a shows the same concept for TRNG applied to the case of a triangular
waveform. Both positive and negative triangular pulses are applied for stochastic
set and deterministic reset, respectively. In this case, the stochastic switching is
evidenced by the different set and reset voltage in cycles n−1 and n, resulting in
different current waveform during the two consecutive cycles. Figure
4.22b shows the
distribution of the integrated current over a single cycle Q n = idt while Fig. 4.22c
shows the difference ΔQ n = Q n − Q n−1 over two consecutive cycles, serving as the
stochastic variable for bit generation. In the TRNG concepts illustrated in Figs. 4.21
4 Applications of Resistive Switching Memory as Hardware Security Primitive 119
Fig. 4.22 a Measured triangular voltage pulses and current response for 2 consecutive cycles n−1
and n, b PDF of the integrated current Qn and c PDF of differential charge ΔQn = Qn − Qn−1 .
The pulse sequence includes positive and negative triangular pulses for stochastic set and reset
transitions, respectively, as evidenced by the abrupt steps in the current response. The random bit
is assigned from the value of ΔQn in (c). Reprinted with permission from [75]. Copyright (2018)
IEEE
and 4.22, the entropy source is either the stochastic distribution of switching time,
or the stochastic distribution of switching voltage, respectively [75].
Generally, TRNG concepts require further whitening algorithm, such as the Von
Neumann correction [76] or the XOR operation [83], to achieve a truly unbiased
bitstream. However, the scheme of Figs. 4.21 and 4.22 can pass the standard test of
the National Institute for Standards and Technology (NIST) [63] without any post-
processing, thus enabling a reduced energy and area overhead of the TRNG circuit
[75]. Figure 4.23 reports the pass rate for the nonoverlapping template test in the
NIST criteria as a function of pulse voltage for rectangular and triangular pulses. The
TRNG with rectangular pulse shows an acceptable accuracy only in correspondence
of a narrow window of voltage, with a randomness degradation for both high and low
voltages. On the other hand, the TRNG with the triangular pulse shows high pass rate
over the whole test range, demonstrating a high voltage-independent randomness.
These results can be explained by considering the applied voltage (V A ) dependence
of the switching parameters tset and Vset (or tr eset and Vr eset ) for rectangular and
triangular pulses [75]. Considering a rectangular pulse, the set time tset can be written
as [85]:
V
tset = τ0 exp (Δ(1 − )), (4.1)
V0
where V0 and τ0 are constants, V is the applied voltage, and Δ is the thermal stability
factor. Given the exponential dependence in (4.1), there is only a narrow window
of voltages where the switching time tset is comparable to the applied pulse width
(Fig. 4.21a). On the other hand, the set voltage under a triangular pulse, where the
applied voltage is ramped according to V (t) = 2V A t/t P , can be estimated from
120 R. Carboni and D. Ielmini
Fig. 4.23 Pass rate of the nonoverlapping template NIST test as a function of pulse voltage for
rectangular and triangular pulses. The pass rate is referred to a total of 148 tests. Rectangular pulses
show an operation window around 0.6 V, whereas triangular pulses show voltage-independent high
randomness. Reprinted with permission from [75]. Copyright (2018) IEEE
the switching integrated probability reaching one, namely 1/tset dt = 1, with tset
defined by (4.1). Thus, the set voltage along a triangular pulse is given by [64, 82]:
t0 V A
Vset ≈ V0 ln , (4.2)
V0 t P
The RRAM device variability sources discussed for TRNG can in principle be
adopted for PUF systems, thus enabling a small area, low power consumption,
and high PUF performance in terms of uniqueness and reliability. For instance,
the stochastic resistance variation in RRAM was proposed for a reconfigurable PUF
[86]. Figure 4.24a shows the calculated lognormal distributions of RRAM resis-
tance for LRS and HRS. Figure 4.24b is a sketch of a PUF circuit consisting of
an RRAM array where each cell represents a single bit and can be initialized in
either LRS or HRS. The challenge consists of the address of two n-bit data, while
the response is the bit-wise comparison of the RRAM resistance of the two data. In
4 Applications of Resistive Switching Memory as Hardware Security Primitive 121
Fig. 4.24 a Simulated resistance distributions for LRS and HRS, following normal and lognormal
distributions, respectively. b Schematic illustration of a PUF implementation exploiting RRAM
resistance variability. Reprinted with permission from [86]. Copyright (2014) IEEE
this PUF concept, the stochastic switching allows for the reconfiguration of the PUF
by reprogramming the RRAM array, in stark contrast with systems based on fixed
manufacturing variations. PUF reconfigurability significantly enhances security pro-
tocols based on authentication [87], since it allows to overcome the limitations due
to device degradation or small CRP set. Figure 4.25 shows the characterization of the
PUF against three of the performance parameters in Sect. 4.2, namely, unpredictabil-
ity, unclonability, and reliability. First, the unpredictability of the PUF response can
be measured by studying the output bit uniformity. Figure 4.25 shows the character-
ization of the PUF against three of the performance parameters in Sect. 4.2, namely
unpredictability, unclonability and reliability. First, the unpredictability of the PUF
response can be measured by studying the output bit uniformity. Figure 4.25a shows
“1” bias distributions of 256-bit responses, thus supporting a uniform output, also
confirmed by the almost equal probabilities of 3-bit responses in Fig. 4.25b. Second,
the unclonability requires that the physical (or mathematical) CRP mapping cannot
be replicated, which in turn requires a strong uniqueness of PUF to distinguish a
specific chip from another. This property can be assessed as the Hamming distance
(HD) between the responses of two different PUFs to the same challenge. It is also
referred to as the inter-chip HD (HDinter ), which should be ideally 50%. Figure 4.25c
shows the calculated HDinter for 100 PUF samples of 256 kb RRAM arrays, demon-
strating an ideal HDinter close to 50%. Finally, reliability refers to the ability of a
PUF of giving always the same response to a given challenge. To evaluate the PUF
reliability, the intra-chip HD (HDintra ) can be calculated in this case among different
responses to the same challenge for the same PUF under different conditions (such
as temperature). The HDintra should be 0% for an ideal PUF, and a large separation
between HDinter and HDintra reduces false identification rate [86]. HDintra might
be affected by the dependence of RRAM resistance on temperature and voltage.
For instance, Fig. 4.25d shows the resistance as a function of temperature for two
122 R. Carboni and D. Ielmini
Fig. 4.25 a Distribution of the uniformity measured by “1” bias of a PUF implemented on a
256 kbit array. The relatively uniform output is demonstrated by the uniform occurrence of the
3-bit responses (b). c Uniqueness measured by HDinter distribution. d A resistance crossing event
between two different cells at increasing temperature, which causes a bit flipping and consequently
a reliability degradation. e Effect on HDintra distributions under different voltage fluctuations. f
HDintra distributions at different temperatures. Reprinted with permission from [86]. Copyright
(2014) IEEE
RRAM cells with two different activation energies [86]. Note the crossing between
the two resistance values at high temperature, thus resulting in a bit flip and a con-
sequent reduction of the reliability. Figures. 4.25e–f shows the impact of voltage
4 Applications of Resistive Switching Memory as Hardware Security Primitive 123
Fig. 4.26 a Schematic illustration of the resistive crosspoint array, which implements a strong PUF
by exploiting the sneak paths. b Distributions of cell current before and after the one-time program-
ming, showing quite large analog distribution. Reprinted with permission from [43]. Copyright
(2016) IEEE
Fig. 4.27 a Distributions of HDinter of 12-bit responses for 11 different input vectors. b Measured
read current for 12 column as a function of time at T = 120 ◦ C. c HDintra of 12-bit responses to
the same challenge as a function of time for three different temperatures T = 100, 120 and 140 ◦ C.
Reprinted with permission from [43]. Copyright (2016) IEEE
Fig. 4.28 a Schematic of the crosspoint array enabling secure fingerprint extraction only after
provable key erasure, where the fingerprint is given by the comparison of LRS conductance between
two neighboring memristor cells. b Typical 128 × 32 fingerprint that can be generated from a 128 ×
64 memristor array. Reprinted with permission from Macmillan Publishers Ltd: Nature Electronics
[91]. Copyright (2018)
cell pairs in the array, after initializing all of them in the LRS. The random bit
identifying each pair is set to “1” if G L R S,le f t ≥G L R S,right , to “0” otherwise. Owing
to the intrinsic variability of LRS, a random pattern (i.e., the fingerprint)is generated
to identify uniquely the device, as shown in (Fig. 4.28b). Figure 4.29 shows the
experimental demonstration of provable key destruction. Here, an initial fingerprint
(FPchi p , Fig. 4.29a) is generated and securely stored in a trusted database. Then, a
random key (Kchi p , Fig. 4.29b) is written in the array, thus preventing the re-writing of
FPchi p without losing Kchi p . Kchi p is also sent to the trusted party, so that it can be used
for unlocking features of the specific chip instance storing Kchi p . When a key erasure
is necessary, the user simply reinitializes the array to the LRS, therefore destroying
the key Kchi p and generating a new fingerprint (FP’chi p , Fig. 4.29c), which constitutes
the demonstration of key erasure. The new fingerprint FP’chi p is finally sent to the
trusted party for comparison with the previously stored FPchi p . If the HD between
the two fingerprints is compatible with the expected distance between fingerprints of
the same chip, then the chip can be authenticated by the trusted party. In addition, the
trusted party also gets confirmation that Kchi p has been erased, since it is required
for generating a valid FP’chi p . The practical feasibility of the described concept is
demonstrated in Fig. 4.29d, showing that the distribution of HD for the same chip is
clearly separated from the distribution of HD for different chips. Figure 4.29e shows
the same distributions for 256-bit fingerprint, where the improved separation between
the two distribution supports the need for a large number of bits in the fingerprint.
126 R. Carboni and D. Ielmini
Fig. 4.29 a Initial fingerprint FPchi p stored by the trusted party. b Digital key Kchi p written in
the memristor array. c A second fingerprint FP’chi p generated by the same array, thus destroying
the key. d HD distributions of 128-bit fingerprints from same chip and different chips, showing
sufficient separation, hence demonstrating the feasibility of the scheme. e The same comparison is
given for 256-bit fingerprints. Reprinted with permission from Macmillan Publishers Ltd: Nature
Electronics [91]. Copyright (2018)
Acknowledgements This article has received funding from the European Research Council (ERC)
under the European Union’s Horizon 2020 research and innovation programme (grant agreement
No. 648635).
References
1. J. Rajendran, R. Karri, J.B. Wendt, M. Potkonjak, N.R. McDonald, G.S. Rose, B.T. Wysocki,
Nanoelectronic solutions for hardware security. IACR Cryptol. ePrint Arch. 2012, 575 (2012)
2. C. Stergiou, K.E. Psannis, B.-G. Kim, B. Gupta, Secure integration of iot and cloud computing.
Futur. Gener. Comput. Syst. 78, 964–975 (2018)
3. C. Herder, M.-D. Yu, F. Koushanfar, S. Devadas, Physical unclonable functions and applica-
tions: a tutorial. Proc. IEEE 102(8), 1126–1141 (2014)
4. M.-W. Ryu, J. Kim, S.-S. Lee, M.-H. Song, Survey on internet of things. SmartCR 2(3), 195–
202 (2012)
5. K.-K.R. Choo, M.M. Kermani, R. Azarderakhsh, M. Govindarasu, Emerging embedded and
cyber physical system security challenges and innovations. IEEE Trans. Dependable Secur.
Comput. 3, 235–236 (2017)
6. F. Tehranipoor, Towards implementation of robust and low-cost security primitives for resource-
constrained iot devices (2018), arXiv:1806.05332
7. H. Nili, G.C. Adam, B. Hoskins, M. Prezioso, J. Kim, M.R. Mahmoodi, F.M. Bayat, O. Kavehei,
D.B. Strukov, Hardware-intrinsic security primitives enabled by analogue state and nonlinear
conductance variations in integrated memristors. Nat. Electron. 1(3), 197 (2018)
8. S.K. Mathew, S. Srinivasan, M.A. Anders, H. Kaul, S.K. Hsu, F. Sheikh, A. Agarwal, S.
Satpathy, R.K. Krishnamurthy, 2.4 Gbps, 7 mw all-digital PVT-variation tolerant true random
number generator for 45 nm CMOS high-performance microprocessors. IEEE J. Solid-State
Circuits 47(11), 2807–2821 (2012)
128 R. Carboni and D. Ielmini
9. J. Katz, A.J. Menezes, P.C. Van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography
(CRC Press, 1996)
10. D. Ielmini, H.-S.P. Wong, In-memory computing with resistive switching devices. Nat. Elec-
tron. 1(6), 333 (2018)
11. J.J. Yang, D.B. Strukov, D.R. Stewart, Memristive devices for computing. Nat. Nanotechnol.
8(1), 13 (2013)
12. C.-H. Chang, Y. Zheng, L. Zhang, A retrospective and a look forward: fifteen years of physical
unclonable function advancement. IEEE Circuits Syst. Mag. 17(3), 32–62 (2017)
13. G.S. Rose, Security meets nanoelectronics for internet of things applications, in Proceedings
of the 26th Edition on Great Lakes Symposium on VLSI (ACM, 2016), pp. 181–183
14. S. Ghosh, Spintronics and security: prospects, vulnerabilities, attack models, and preventions.
Proc. IEEE 104(10), 1864–1893 (2016)
15. A. Alaghi, J.P. Hayes, Survey of stochastic computing. ACM Trans. Embed. Comput. Syst.
(TECS) 12(2s), 92 (2013)
16. J.S. Friedman, L.E. Calvet, P. Bessière, J. Droulez, D. Querlioz, Bayesian inference with Müller
C-elements. IEEE Trans. Circuits Syst. I: Regul. Pap. 63(6), 895–904 (2016)
17. W. Maass, Noise as a resource for computation and learning in networks of spiking neurons.
Proc. IEEE 102(5), 860–880 (2014)
18. P.A. Merolla, J.V. Arthur, R.Alvarez-Icaza, A.S. Cassidy, J. Sawada, F. Akopyan, B.L. Jackson,
N. Imam, C. Guo, Y. Nakamura et al., A million spiking-neuron integrated circuit with a scalable
communication network and interface. Science 345(6197), 668–673 (2014)
19. G. Pedretti, V. Milo, S. Ambrogio, R. Carboni, S. Bianchi, A. Calderoni, N. Ramaswamy, A.S.
Spinelli, D. Ielmini, Stochastic learning in neuromorphic hardware via spike timing dependent
plasticity with rram synapses. IEEE J. Emerg. Sel. Top. Circuits Syst. 8(1), 77–85 (2018)
20. G. Alvarez, S. Li, Some basic cryptographic requirements for chaos-based cryptosystems. Int.
J. Bifurc. Chaos 16(08), 2129–2151 (2006)
21. Maxim Integrated, Pseudo random number generation using linear feedback shift registers
(2010), Retrieved from Maxim Integrated website: http://www.maximintegrated.com/an4400
22. J. Von Neumann, Various techniques used in connection with random digits. Appl. Math. Ser.
12(36–38), 5 (1951)
23. J. Kelsey, B. Schneier, D. Wagner, C. Hall, Cryptanalytic attacks on pseudorandom number
generators, in International Workshop on Fast Software Encryption (Springer, 1998), pp. 168–
188
24. Suresh Chari, Charanjit Jutla, Josyula R Rao, and Pankaj Rohatgi. A cautionary note regard-
ing evaluation of aes candidates on smart-cards. In Second Advanced Encryption Standard
Candidate Conference, pages 133–147. Citeseer, 1999
25. N. Gisin, G. Ribordy, W. Tittel, H. Zbinden, Quantum cryptography. Rev. Mod. Phys. 74(1),
145 (2002)
26. B. Jun, P. Kocher, The Intel random number generator. Cryptogr. Res. Inc. White Pap. 27, 1–8
(1999)
27. S. Sahay, M. Suri, Recent trends in hardware security exploiting hybrid cmos-resistive memory
circuits. Semicond. Sci. Technol. 32(12), 123001 (2017)
28. R. Brederlow, R. Prakash, C. Paulus, R. Thewes, A low-power true random number generator
using random telegraph noise of single oxide-traps, in IEEE International Solid-State Circuits
Conference, 2006. ISSCC 2006. Digest of Technical Papers (IEEE, 2006), pp. 1666–1675
29. C.-Y. Huang, W.C. Shen, Y.-H. Tseng, Y.-C. King, C.-J. Lin, A contact-resistive-random-
access-memory-based true-random-number generator. IEEE Electron Device Lett. 33(8), 1108
(2012)
30. A. Fukushima, T. Seki, K. Yakushiji, H. Kubota, H. Imamura, S. Yuasa, K. Ando, Spin dice:
a scalable truly random number generator based on spintronics. Appl. Phys. Express 7(8),
083001 (2014)
31. S. Chun, S.-B. Lee, M. Hara, W. Park, S.-J. Kim, High-density physical random number
generator using spin signals in multidomain ferromagnetic layer. Adv. Condens. Matter Phys.
(2015)
4 Applications of Resistive Switching Memory as Hardware Security Primitive 129
from the integration and materials property points of view. Adv. Funct. Mater. 24(34), 5316–
5339 (2014)
54. A. Bricalli, E. Ambrosi, M. Laudato, M. Maestro, R. Rodriguez, D. Ielmini. SiOx -based resis-
tive switching memory (RRAM) for crossbar storage/select elements with high on/off ratio, in
2016 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2016), pp. 4.3.1–4.3.4
55. D. Ielmini, Modeling the universal set/reset characteristics of bipolar RRAM by field-and
temperature-driven filament growth. IEEE Trans. Electron Devices 58(12), 4309–4317 (2011)
56. S. Larentis, F. Nardi, S. Balatti, D.C. Gilmer, D. Ielmini, Resistive switching by voltage-driven
ion migration in bipolar RRAM—part ii: modeling. IEEE Trans. Electron Devices 59(9), 2468–
2475 (2012)
57. S. Ambrogio, S. Balatti, A. Cubeta, A. Calderoni, N. Ramaswamy, D. Ielmini, Statistical fluc-
tuations in HfOx resistive-switching memory: part i-set/reset variability. IEEE Trans. Electron
Devices 61(8), 2912–2919 (2014)
58. S. Ambrogio, S. Balatti, A. Cubeta, A. Calderoni, N. Ramaswamy, D. Ielmini, Statistical
fluctuations in HfOx resistive-switching memory: part ii–random telegraph noise. IEEE Trans.
Electron Devices 61(8), 2920–2927 (2014)
59. S. Ambrogio, S. Balatti, V. McCaffrey, D.C. Wang, D. Ielmini, Noise-induced resistance broad-
ening in resistive switching memory—part i: intrinsic cell behavior. IEEE Trans. Electron
Devices 62(11), 3805–3811 (2015)
60. S. Ambrogio, S. Balatti, V. McCaffrey, D.C. Wang, D. Ielmini, Noise-induced resistance broad-
ening in resistive switching memory—part ii: array statistics. IEEE Trans. Electron Devices
62(11), 3812–3819 (2015)
61. D. Ielmini, F. Nardi, C. Cagli, Resistance-dependent amplitude of random telegraph-signal
noise in resistive switching memories. Appl. Phys. Lett. 96(5), 053503 (2010)
62. Y. Yoshimoto, Y. Katoh, S. Ogasahara, Z. Wei, K. Kouno, A ReRAM-based physically unclon-
able function with bit error rate < 0.5% after 10 years at 125 ◦ C for 40nm embedded application
in 2016 IEEE Symposium on VLSI Technology (IEEE, 2016), pp. 1–2
63. STS NIST, Special publication 800-22. A statistical test suite for random and pseudorandom
number generators for cryptographic applications (2010)
64. C. Cagli, F. Nardi, D. Ielmini, Modeling of set/reset operations in NiO-based resistive-switching
memory devices. IEEE Trans. Electron Devices 56(8), 1712–1720 (2009)
65. S.H. Jo, T. Chang, K.-H. Kim, S. Gaba, W. Lu, Experimental, modeling and simulation studies
of nanoscale resistance switching devices, in 9th IEEE Conference on Nanotechnology, 2009.
IEEE-NANO 2009 (IEEE, 2009), pp. 493–495
66. H. Jiang, D. Belkin, S.E. Savel’ev, S. Lin, Z. Wang, Y. Li, S. Joshi, R. Midya, C. Li, M. Rao
et al., A novel true random number generator based on a stochastic diffusive memristor. Nat.
Commun. 8(1), 882 (2017)
67. S. Gaba, P. Knag, Z. Zhang, W. Lu, Memristive devices for stochastic computing, in 2014 IEEE
International Symposium on Circuits and Systems (ISCAS) (IEEE, 2014), pp. 2592–2595
68. S. Gaba, P. Sheridan, J. Zhou, S. Choi, L. Wei, Stochastic memristive devices for computing
and neuromorphic applications. Nanoscale 5(13), 5872–5878 (2013)
69. S.H. Jo, K.-H. Kim, W. Lu, Programmable resistance switching in nanoscale two-terminal
devices. Nano Lett. 9(1), 496–500 (2008)
70. T. Ohno, T. Hasegawa, T. Tsuruoka, K. Terabe, J.K. Gimzewski, M. Aono, Short-term plasticity
and long-term potentiation mimicked in single inorganic synapses. Nat. Mater. 10(8), 591
(2011)
71. Z. Wang, S. Joshi, S.E. Savel’ev, H. Jiang, R. Midya, P. Lin, M. Hu, N. Ge, J.P. Strachan, Z. Li
et al., Memristors with diffusive dynamics as synaptic emulators for neuromorphic computing.
Nat. Mater. 16(1), 101 (2017)
72. A. Bricalli, E. Ambrosi, M. Laudato, M. Maestro, R. Rodriguez, D. Ielmini, Resistive switching
device technology based on silicon oxide for improved on-off ratio–part ii: select devices. IEEE
Trans. Electron Devices 65(1), 122–128 (2018)
73. R. Midya, Z. Wang, J. Zhang, S.E. Savel’ev, C. Li, M. Rao, M.H. Jang, S. Joshi, H. Jiang, P.
Lin et al., Anatomy of Ag/hafnia-based selectors with 1010 nonlinearity. Adv. Mater. 29(12),
1604457 (2017)
4 Applications of Resistive Switching Memory as Hardware Security Primitive 131
74. S. Ambrogio, S. Balatti, S. Choi, D. Ielmini, Impact of the mechanical stress on switching
characteristics of electrochemical resistive memory. Adv. Mater. 26(23), 3885–3892 (2014)
75. R. Carboni, W. Chen, M. Siddik, J. Harms, A. Lyle, W. Kula, G. Sandhu, D. Ielmini, Random
number generation by differential read of stochastic switching in spin-transfer torque memory.
IEEE Electron Device Lett. (2018)
76. W.H. Choi, Y. Lv, J. Kim, A. Deshpande, G. Kang, J.-P. Wang, C.H. Kim, A magnetic tunnel
junction based true random number generator with conditional perturb and real-time output
probability tracking. in 2014 IEEE International Electron Devices Meeting (IEDM) (IEEE,
2014), pp. 12.5.1–12.5.4
77. A. Fantini, L. Goux, R. Degraeve, D.J. Wouters, N. Raghavan, G. Kar, A. Belmonte, Y.-Y.
Chen, B. Govoreanu, M. Jurczak, Intrinsic switching variability in HfO2 RRAM, in 2013 5th
IEEE International Memory Workshop (IMW) (IEEE, 2013), pp. 30–33
78. R. Carboni, S. Ambrogio, W. Chen, M. Siddik, J. Harms, A. Lyle, W. Kula, G. Sandhu, D.
Ielmini, Understanding cycling endurance in perpendicular spin-transfer torque (p-STT) mag-
netic memory, in 2016 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2016),
pp. 21.6.1–21.6.4
79. J.J. Nowak, R.P. Robertazzi, J.Z. Sun, G. Hu, J.-H. Park, J.H. Lee, A.J. Annunziata, G.P. Lauer,
R. Kothandaraman, E.J. O’Sullivan et al., Dependence of voltage and size on write error rates
in spin-transfer torque magnetic random-access memory. IEEE Magn. Lett. 7, 1–4 (2016)
80. D. Apalkov, B. Dieny, J.M. Slaughter, Magnetoresistive random access memory. Proc. IEEE
104(10), 1796–1830 (2016)
81. A.F. Vincent, N. Locatelli, J.-O. Klein, W.S. Zhao, S. Galdin-Retailleau, D. Querlioz, Analytical
macrospin modeling of the stochastic switching time of spin-transfer torque devices. IEEE
Trans. Electron Devices 62(1), 164–170 (2015)
82. Z. Li, S. Zhang, Thermally assisted magnetization reversal in the presence of a spin-transfer
torque. Phys. Rev. B 69(13), 134416 (2004)
83. D. Vodenicarevic, N. Locatelli, A. Mizrahi, J.S. Friedman, A.F. Vincent, M. Romera, A.
Fukushima, K. Yakushiji, H. Kubota, S. Yuasa et al., Low-energy truly random number genera-
tion with superparamagnetic tunnel junctions for unconventional computing. Phys. Rev. Appl.
8(5), 054045 (2017)
84. A. Mizrahi, N. Locatelli, R. Lebrun, V. Cros, A. Fukushima, H. Kubota, S. Yuasa, D. Quer-
lioz, J. Grollier, Controlling the phase locking of stochastic magnetic bits for ultra-low power
computation. Sci. Rep. 6, 30535 (2016)
85. R. Heindl, W.H. Rippard, S.E. Russek, M.R. Pufall, A.B. Kos, Validity of the thermal activation
model for spin-transfer torque switching in magnetic tunnel junctions. J. Appl. Phys. 109(7),
073910 (2011)
86. A. Chen, Utilizing the variability of resistive random access memory to implement reconfig-
urable physical unclonable functions. IEEE Electron Device Lett. 36(2), 138–140 (2015)
87. K. Kursawe, A.-R. Sadeghi, D. Schellekens, B. Skoric, P. Tuyls, Reconfigurable physical
unclonable functions-enabling technology for tamper-resistant storage (2009)
88. J. Zhou, K.-H. Kim, L. Wei, Crossbar rram arrays: selector device requirements during read
operation. IEEE Trans. Electron Devices 61(5), 1369–1376 (2014)
89. Y.Y. Chen, M. Komura, R. Degraeve, B. Govoreanu, L. Goux, A. Fantini, N. Raghavan, S.
Clima, L. Zhang, A. Belmonte, A. Redolfi, G.S. Kar, G. Groeseneken, D.J. Wouters, M. Jurczak,
Improvement of data retention in HfO2 /Hf 1T1R RRAM cell under low operating current
90. Y. Xie, A. Srivastava, Mitigating sat attack on logic locking, in Cryptographic Hardware and
Embedded Systems – CHES 2016, ed. by B. Gierlichs, A.Y. Poschmann (Springer, Berlin,
2016), pp. 127–146
91. H. Jiang, C. Li, R. Zhang, P. Yan, P. Lin, Y. Li, J.J. Yang, D. Holcomb, Q. Xia, A provable key
destruction scheme based on memristive crossbar arrays. Nat. Electron. 1(10), 548–554 (2018)
Chapter 5
Memristive Biosensors for Ultrasensitive
Diagnostics and Therapeutics
Even nowadays, the medical devices still face several limitations concerning rapid,
reliable, and ultrasensitive sensing of biomarkers from a minimized volume of clini-
cal samples. In particular, cancer diagnosis usually involves uncomfortable medical
tests, long waiting times for the results of the medical assessment, nonetheless, risk-
ing to obtain an uncertain medical outcome. An other very important aspect is the
diagnosis of the disease at early stages, when suitable therapy decision-making can
be taken into consideration for treatment, giving higher probability of success at the
beginning of the disease. However, the diagnostic tools still lack the level of resolu-
tion needed for the detection of biomarkers at the early stages of the disease. More-
over, the clinical practice still lacks analytical methods for efficient, ultrasensitive
monitoring of therapeutic compounds. Reliable, low-cost, and accessible therapeu-
tic compound monitoring systems for individualized health care, and especially for
treatment of malignant diseases, such as cancer and AIDS consist of a very impor-
tant aspect in medical practice. These requirements are even more highlighted for
drugs demonstrating a very narrow therapeutic window which is also depicted at low
concentrations. Moreover, different patients may present different responses to the
very same dose of drug, giving different therapeutic response from what expected.
Therefore, the realization of novel ultrasensitive nanobiosensors for the direct and
label-free detection of chemical and biological species which present high reliability,
robustness, and the advantage of a quick data acquisition may achieve optimum sens-
ing output in both diagnostics and therapeutics fields, opening to early diagnostics
and a treatment with higher efficacy and lower side effects for patients.
Nanostructure-based sensors are considered as a highly promising strategy to
address the issues of sensitivity and limits of detection for both diagnostics and ther-
apeutics and may allow the integration of the sensors in portable devices including
microfluidics and electronics for robust, flexible, and automatized clinical applica-
tions. Silicon (Si) nanowires with their unique properties such as the high surface-
to-volume ratio and the size comparable to biomolecules, and combined with the
specificity of immune-sensing techniques, may provide optimum biosensing plat-
forms [1].
In addition, although the fact that memristive effect has already been introduced
in many different applications, very few of the implementations are dedicated for
bio-detection. Carrara et al. [2] demonstrated for the first time the potential use of
memristive effects in nanostructured devices for biosensing applications. Therefore,
the aspect of memristive phenomena is expanded and enlarged by coupling nanofab-
ricated devices that express memristive phenomena with biological processes, for
introducing novelty and bringing new solutions to the biosensing field.
Nanowire arrays are emerging as promising building blocks for miniaturized bioas-
says. In the case of the memristive nanowires, the biosensing is based on the variations
of a voltage difference introduced in the semi-logarithmic current-to-voltage charac-
teristics upon the introduction of charged substances on the surface. The memristive
nanowires are realized by using commercially available Silicon-on-Insulator wafers
and the nanofabrication can be summarized in two electron-beam (e-beam) lithog-
raphy masks. The first e-beam lithography mask is dedicated to the definition of the
nanodevice electrodes. The electrodes creation is realized through Nickel (Ni) evap-
5 Memristive Biosensors for Ultrasensitive … 135
SiO2
Source Drain
Si-NW arrays
Fig. 5.1 SEM top and tilted view of the vertically stacked nanowire structures bridging NiSi source
and drain contacts (Reproduced with permission from [3]. Copyright 2016 American Chemical
Society)
oration, liftoff, and annealing processes. The second e-beam lithography operation is
performed for the nanowire patterning, and then as a last step, the nanowire structures
are etched through repeated Bosch process etching cycles of the upper Si. Overall,
this process results to suspended, vertically stacked, two-terminal, Schottky-barrier Si
nanowire arrays anchored between the two nickel silicide (NiSi) pillars (Fig. 5.1) for
devices designed with a smaller geometry, i.e., length of 420 nm and width of 35 nm,
and of larger geometry, i.e., length of 980 nm and width of 90 nm (inset Fig. 5.1). The
particular electrical response of those memristive nanodevices provides a label-free
ultrasensitive bio-detection method. More specifically, the electrical characterization
of the nanodevices is performed with double sweeping the source-to-drain voltage
(Vds) at a fixed 0 V back-gate potential. One of the distinctive features of the electri-
cal response of these nanowires is the recorded hysteresis loop that it is characteristic
of a memristive system (Fig. 5.2 top). In these nanodevices, the memory effect can be
attributed to the rearrangement of the charge carriers at the nanoscale due to external
perturbations [4], such as an applied voltage bias. For most bare nanowire devices,
this hysteresis appears fully pinched at zero voltage. In some other cases, this hys-
teresis is almost pinched at very close to zero voltage values due to the impact of
environmental conditions, such as the ambient humidity that introduces perturbations
to the conductivity of the device, affecting in great deal the memristive signals.
Typically, a modification of the hysteresis in the memristive electrical character-
istics is depicted after surface treatment of the nanodevice. The charged nature of the
biological molecules brings to the nanodevice an effect similar to the one brought
136 I. Tzouvadaki et al.
Log|I| [A]
-8
I [A]
1 -8.5
-9
0 -9.5
-10
-1
-10.5
-2 -11
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Input Voltage [V] Input Voltage [V]
0.5 -7
-8
0
Log|I| [A]
-9
I [A]
-0.5
-10
-1 -11
-13
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Input Voltage [V] Input Voltage [V]
Fig. 5.2 Experimental electrical response obtained for bare two-terminal Schottky-barrier Si
nanowire arrays exhibiting memristive characteristics (memristive devices)—top—and experimen-
tal electrical response after surface treatment (memristive biosensors)—bottom. The pinched hys-
teresis and memristive characteristics are lost giving rise to a voltage gap in the semi-logarithmic
electrical characteristics
without any bio-functionalization but with the presence of an inorganic all around
gate [2]. The net charge from the presence of biomolecules induces a change in the
initial hysteresis creating a sort of voltage memory, appearing as a voltage difference,
so-called voltage gap, between the current minima positions in the forward and back-
ward branches, as a further memory effect of the voltage scan across the nanowire
(Fig. 5.2 bottom) [2, 5]. More specifically, this voltage gap is dependent from the
kind and the concentration of the charged substances introduced on the device sur-
face and it is very sensitive to the charges interplay. Overall, the memristive devices
are accordingly bio-functionalized with receptor molecules for obtaining memristive
biosensors and then exposed to the target molecules providing ultrasensitive sens-
ing through the variations of this voltage gap that consists the main bio-detection
parameter.
5 Memristive Biosensors for Ultrasensitive … 137
(i)
(a)
(b)
(ii)
(a) (b)
Nanowire arrays
Fig. 5.3 i. AFM morphological analysis of nanowire arrays before (a) and after the bio-modification
with an anti-Prostate Specific Antigen antibody (b). After bio-functionalization a clear change in the
morphology can be seen and an agglomeration of biomolecules can be observed on the surface of the
nanodevices masking the initial shape of the nanowires ii. Confocal microscopy of nanofabricated
structures before and after bio-functionalization: 3D fluorescence signal distribution acquired using
CLSM (wire-arrays of width 150 nm and length of 4.8 µm) before (control sample) (a) and after
the bio-modification with FITC-conjugated antibodies (b). The bright regions in the right image
correspond to the accumulated biomolecule in the sample (Reference [6]-Reproduced by permission
of The Royal Society of Chemistry)
138 I. Tzouvadaki et al.
The modification of the related electrical-conductivity hysteresis and the voltage gap
variations due to the presence of charged macromolecules was investigated and fully
characterized through the deposition of layer-by-layer charged polymeric films, i.e.,
via the implementation of a polyelectrolyte (PE) multilayer. PEs are linear macro-
molecular chains bearing a large number of charged groups when dissolved in a suit-
able polar solvent. Among them, PSS (Poly (sodium 4-styrene sulfonate)) is a strong
polyelectrolyte negatively charged in a wide pH range while PAH (Poly(allylamine
hydrochloride)) is a weak polyelectrolyte positively charged in neutral or acidic solu-
tion [7]. Subsequent depositions finally result in a PE multilayer stabilized by strong
electrostatic forces [8].
The formation of PE multilayers is based on the consecutive adsorption of polyions
with alternating charge using the layer-by-layer (LBL) technique as described by
Chen and McCarthy [9]. The PEs multilayer is formed by consecutive alternate
adsorption of positively charged PAH and negatively charged PSS prepared PE
solutions. Electrical characteristics acquired (Fig. 5.4) indicated the average volt-
age gap value after deposition of each layer of PE for two different concentrations.
The first electrical measurements were performed after -OH treatment leading to the
appearance of the voltage gap. Afterward, the first PAH adsorption results in nar-
rowing of the voltage gap (0.09 V difference for 200 nM concentration and 0.16 V
difference for the case of 50 µM of PE, respectively). This change is a result of the
change in the charge density at the surface of the device due to the positively charged
PAH, an effect that is even more pronounced when using a higher concentration of
PAH: as more positive charges are present on the surface, a larger voltage gap change
is registered.
-OH groups
PSS
-OH groups
0,20 PSS
PSS PAH
PSS
0,15 PAH
PAH PSS
PAH
0,10 PAH
PSS
0,05
PSS PSS
PAH
0,00
PAH
PSS
1 2 3 4 5 6 7 PAH
Polyelectrolyte Layers
Fig. 5.4 Formation of a multilayer of PEs by repeated electrostatic adsorption of oppositely charged
PE layers; Average voltage gap value obtained from electrical characterization of devices treated
with Layer-by-Layer deposition of PEs for 200 nM (red points) and 50 µM (black points) (Repro-
duced with permission from [3]. Copyright 2016 American Chemical Society)
The adsorption of the negatively charged PSS shifts again the average voltage gap
to a higher value (0.17 V) and further treatment with PAH results in a new decrease
of the average voltage gap value (0.15 V for 200 nM) concentration of PE. There-
fore, it is demonstrated that further alternating exchange of the PE solution causes
an alternating output signal, which slowly reduces in amplitude. On the other hand,
the consecutive adsorption of the same type of PE (successive adsorption of PSS is
presented in Fig. 5.4) was tested by implementing the highest concentration of PE
results in the acquisition of one direction trend for the voltage gap that increases
form the value of 0.05–0.21 V. Facing these interesting characteristics of the mem-
ristive nanodevices, the fabricated memristive nanostructures are thereupon applied
in the biosensing field, enabling the detection of femtomolar and even attomolar con-
centrations. The charges interplay (positive/negative) brought by the receptor/target
molecules and the concentration of the reagents (increasing/decreasing) is defining
the width of the voltage gap parameter which then lands to an ultrasensitive bio-
detection method with an immense potential for novel biosensing.
Current [A]
Further
Antigen Uptake
Bio-functionalization Antigen
with Antibodies Uptake
Voltage [V]
correct physiological conditions (pH 7.4) arginine and lysine residues are positively
charged while aspartic and glutamic acids are negatively charged. In an antibody,
positively charged residues are in excess with respect to negatively charged ones
even if the charge distribution is quite similar [2]. On the contrary, antigens like PSA
are negatively charged; therefore, when antibody–antigen binding occurs, an excess
of negative charge accumulates at the nanowire surface that increases with respect
to the increasing antigen concentration as the target molecules uptake progresses.
Meanwhile, taking into consideration that aptamers are single-stranded RNA or DNA
oligonucleotides, they are considered negatively charged, therefore, for aptamer and
negatively charged antigen/drug pairs a one-way, increasing trend for the voltage gap
is expected with the increasing antigen/drug concentration.
The Debye screening length between the sensor surface and the analyte determines
the extent of a space charge region near a discontinuity and it is commonly intro-
duced when performing measurements in-liquid conditions. The surface charges of
biomolecules in a buffer solution are shielded by oppositely charged buffer ions
in the solution, so-called counterions. Therefore, the Debye length may potentially
mask the sensing outcomes in some cases, for instance, for extremely low sample
5 Memristive Biosensors for Ultrasensitive … 141
Saturation
High humidity
impact
Fig. 5.6 Memristive device rH% Calibration: Average voltage gap value exhibited by non-bio-
modified nanostructures just after the fabrication process tested under different relative humidity
conditions (rH%)
Fig. 5.7 Voltage gap dependence upon the bio-functionalization reagent: anti-PSA DNA aptamers
(≈15 kDa), anti-PSA ScAb (≈42 kDa), and full-size anti-PSA IgG antibody (≈150 kDa) demon-
strate different sizes, and therefore correspond to different voltage gap values resulting in a linear
trend for the voltage gap–reagent size relation [12]
Due to the wide application possibilities that memristor devices may offer, several
efforts have been done to study the memristive behavior experimentally as well as
computationally [13–20]. Besides theoretical aspects and experimental studies, mod-
els which approximate well the physical realization are needed. In this framework,
it is worth mentioning the development of a simple compact model for representing
the electrical behavior of memristors introduced by Biolek et al. [21] describing a
mathematical SPICE model of the prototype of memristor, manufactured in 2008
in Hewlett-Packard (HP) Labs [13]. Furthermore, Benderli [22] suggested a macro-
model which simulates the electrical behavior of the thin-film titanium dioxide (TiO2 )
memristors. Last but not least in their work, Rak et al. [23] create a memristor ele-
ment in Spice which simulates the published memristor realization introduced by
HP Labs and offers the possibility to be used as a circuit element in design work. For
the computational study of the memristive biosensors, a macromodel of a memristor
element was created and combined with analog circuit elements forming equivalent
circuit models that reproduced and emulated successfully the behavior of the phys-
ical system fitting in good approximation the experimental results of memristive
biosensors [24]. Throughout simulations and adequate fitting between the experi-
mental and computational outcomes, it was found that the electrical characteristics
obtained from experimental measurements exhibit hysteretic properties imputable to
memristive devices validating the hypothesis that the experimental setup deals with
memristive behavior and confirming the memristive nature of the physical system.
In addition, the voltage gap appearing at the current-to-voltage characteristics for
nanowires with bio-modified surface was successfully reproduced computationally
and was related to capacitive effects due to minority carriers in the nanowire [24].
It was also indicated that those effects are strongly affected by the concentration of
biomolecules uptaken on the device surface.
Log|I| [A]
Vin D D
Fig. 5.8 Equivalent circuit of a memristor sandwiched between two non-identical head-to-head
Schottky barriers. The sub-circuits consisting of a diode in parallel to a resistor emulate the effect of
the Schottky barriers. The circuit consists of resistances in the range of 0.5 k–1k and common
Si epitaxial planar fast switching diodes provided by SPICE (a). Semi-logarithmic current-to-
voltage results obtained from the equivalent circuit comparing to experimental results coming from
electrical measurements of bare memristive device (b). The simulation current is scaled accordingly
to the experimental current range and the input voltage amplitude is [−3:3] Volts. Since, by nature
the devices usually present a non-identical behavior multiple experimental curves are presented
and it is demonstrated that the computationally obtained results follow in good approximation the
average behavior of the physical system, and it can be concluded that the experimental setup exhibits
memristive behavior (elaborated by [24])
would be exchanged with the forward-biased one and vice versa. For consistency
reasons and interest, the input values of the sinusoidal Voltage (Vin) source were the
same with respect to the case of the pure memristor equivalent circuit. Experimental
current-to-voltage characteristics present noticeable asymmetry at the branches of the
semi-logarithmic current-to-voltage characteristics. Under ideal circumstances, the
electrical characteristics in both branches of the semi-logarithmic current-to-voltage
curve would be symmetrical since the Schottky barriers of the device structure are
considered to be identical. Nevertheless, the measured data in real experimental con-
ditions indicates non-identical branches for the majority of the devices under study.
This slight difference in the branches asymmetry may be explained as a consequence
of the non-identical area of contacts occurring in the real conditions, mainly due to
the presence of the different interfacial insulating layers at both electrode contacts.
In order to emulate this asymmetry arising in the physical system, and considering
that the one diode does not conduct during the one circle of the voltage, as mentioned
before, the equivalent circuit in this specific case could be simplified by replacing the
one of the two autonomous sub-circuits (RD) by a resistor. Therefore, the concept
of the non-identical Schottky barriers is taken into consideration in the equivalent
electrical circuit through the equivalent resistor, and in combination to the fact that
during the one circle of the current (depending on the polarity of the bias voltage)
the one diode does not conduct due to the forward and reverse bias nature of the
diode and consequently only the remaining resistivity origin from the reverse-biased
Schottky diode finally contributes. It was demonstrated that the simulation results
followed in good approximation the average behavior of the physical system, and
presented current-to-voltage characteristic curve equivalent to that of a memristor
device electrically contacted by two asymmetric Schottky barriers, validating the
hypothesis that the experimental setup deals with memristive behavior.
A Schottky diode can also be described with an equivalent circuit model consist-
ing of a nonlinear capacitor in parallel to a nonlinear resistor according to literature
[29]. The capacitor stands for the space charge capacitance and reflects only the
free carriers of the material, while the resistor represents the residual conductance
of the diode. In the case of lightly doped materials, the free-carrier concentration
can become comparable to the deep level concentration, and in this case, charged
and recharged deep levels also contribute considerably to the measured capacitance.
In a Schottky barrier, the barrier is high enough that there is a depletion region in
the semiconductor near the interface. In the depletion region of the Schottky barrier,
dopants remain ionized and give rise to the “space charge” which, in turn, gives rise
to a capacitance of the junction. The metal–semiconductor interface and the opposite
boundary of the depleted area act like two capacitor plates, with the depletion region
acting as a dielectric. The amount of junction capacitance initially depends on the
applied terminal voltages. By applying a voltage to the junction, the width of the
146 I. Tzouvadaki et al.
(a) Experiment
Rc Rc
M
Log|I| [A]
Vin
C C
Fig. 5.9 Equivalent circuit for memristive biosensors consisting of a memristor and nonlinear
sub-circuits (RC). The sub-circuits (RC) introduced consist of a nonlinear capacitor in parallel
to a nonlinear resistor (a) and semi-logarithmic current-to-voltage results simulation (red curve)
obtained from the equivalent circuit, as compared to experimental results coming from electrical
measurements for the case of memristive biosensor, namely, the nanofabricated device after the bio-
functionalization with antibodies, for three voltage sweeps (green curves). The simulation current
is scaled accordingly to the experimental current range. The input voltage amplitude is [−3:3] Volts
and the resistances value at 0.85 k (elaborated by [24])
space charge layer will be shifted and the space charge within the depletion region
will vary, since additional defect centers will be ionized, and as a result the capac-
itance will also be different. Furthermore, the charging and recharging of the trap
levels during a measurement cycle periodically change the Schottky-barrier height,
and finally the modified measurement current gives a capacitive contribution to the
diode admittance. Thus, both effects, variation of bias and consequently the ioniza-
tion of traps, cause a change in the junction capacitance [30]. An equivalent circuit
containing nonlinear sub-circuits (RC) consisting of a nonlinear capacitor in parallel
to a nonlinear resistor was further then introduced (Fig. 5.9a) in order to correctly
model the appearance of the voltage gap, for the case of memristive biosensors.
The sub-circuits were connected in series to the memristor (M) of the initial equiva-
lent circuit. It was demonstrated (Fig. 5.9b) that the two current minima are clearly
separated and a voltage gap appears in the semi-logarithmic current-to-voltage char-
acteristics due to the presence of the capacitors now introduced. The fitting of the
simulation results with the experimental data confirms that the voltage gap appearing
at the experimental current-to-voltage characteristics for the memristive biosensors
was computationally reproduced successfully and fitted in very good approximation
with the experimental outcomes. Measurement of the junction capacitance is a very
useful technique, giving information on Schottky-barrier heights, dopant profiles, as
well as the presence of traps and defects inside the semiconductor and at the interface
[31]. Accumulating evidence from several works concerning relevant measurements
[29–32] report values for capacitances that appear in the junction area that refers to
the space charge capacitance, to belong in the range beginning of pF [29, 30] and
reaching the values of nF [31, 32]. According to literature, the excess capacitance
5 Memristive Biosensors for Ultrasensitive … 147
The presence of antigens on the device surface seems to demonstrate the opposite
effects comparing to those resulting due to the presence of the antibodies. Antigens
are considered to have a masking contribution to the presence of antibodies all around
the device and decrease the positive charge effect due to the presence of antibodies
after the bio-functionalization process. Thus, the uptake of antigens acts by decreas-
ing the value of the positive all around gate bias voltage created by the presence of
antibodies. According to the previous arguments, the presence of antigens also affects
Table 5.1 Voltage gap values obtained experimentally for different antigen concentrations and
computationally estimated voltage gap values for different values of capacitance, selected according
to the experimental data. For 0 fM concentration of antigens, it is considered that the voltage gap
that appears is created only due to the bio-functionalization with antibodies
Antigen concentration Voltage gap (V) Capacitance (nF) Voltage gap (V)
(fM) -experimental- [2] -simulation-
0 0.84 36 0.844
5 0.56 24 0.563
10 0.37 15 0.362
148 I. Tzouvadaki et al.
the width of the voltage gap, which is already created by the presence of antibodies
all around the device after bio-functionalization. Collectively, the experimental data
depicts a contraction of the hysteresis window with increasing the concentration of
antigens. To further define the role of the capacitance value to the voltage gap, the
experimentally obtained results concerning voltage gap values for different antigen
concentrations [2] were taken into consideration and different capacitance values
were introduced to the aforementioned equivalent circuit (Fig. 5.9a) designed for
simulating the modified-memristive behavior, in order to reproduce computationally
the voltage gap values obtained experimentally as reported in Table 5.1. Furthermore,
the calibration curve (Fig. 5.10) depicted the computationally estimated voltage gap
values that equal the values of the voltage gap obtained experimentally [2] for differ-
ent antigen concentrations. The computationally obtained voltage gap values are a
result of the different equivalent capacitance values introduced to the equivalent cir-
cuit and it is found that corresponds to the values reported in literature for the excess
capacitance [32]. Intermediate theoretical values for the voltage gap obtained from
simulations for different capacitance values are also shown in the figure (Fig. 5.10).
It can be noticed that for achieving narrower voltage gaps lower values for the capac-
itance must be introduced to the equivalent circuit for reproducing computationally
the corresponding experimental obtained voltage gap. This evidence suggest that
Antigen Concentration
Capacitance
Fig. 5.10 Calibration curve obtained experimentally for three concentrations. The uptake with
antigens modifies the memristive behavior such as -|Vgap| increases with the increase of the antigen
concentration. For 0 fM concentration of antigens, it is considered that the voltage gap that appears is
created only due to the bio-functionalization with antibodies. The theoretical values for the voltage
gap are results obtained from simulations for different capacitance values (elaborated by [24])
5 Memristive Biosensors for Ultrasensitive … 149
increasing the concentration of antigens demands lower values for the capacitance
introduced to the equivalent circuit in order to achieve the same value for the volt-
age gap, with respect to this range of capacitance values. For zero concentration of
antigens (0 fM), only the voltage gap already created by the presence of antibodies
after the bio-functionalization process is considered.
0.16
0.14
Log|Ids| [A]
Vgap [V]
0.12
0.10
PSA
0.08
DNA aptamers
0.06
0.04
10-17 10-16 10-15 10-14 10-13 10-12 10-11 10-10
Vds [V] PSA [M]
Fig. 5.11 Representative electrical characteristics and PSA dose response of memristive aptasensor:
Indicative electrical characteristics demonstrating the introduction of the voltage gap occurring upon
bio-modification of the surface of the nanodevice (a). Calibration curve related to the average voltage
gap versus dose response (b) (Reproduced with permission from [3]. Copyright 2016 American
Chemical Society)
of pM. This outcome signifies that we are within and actually slightly below the
clinical range (critical level of PSA 4 ng/mL ca. 133 pM). This fact allows working
with highly diluted samples, significantly low volumes of clinical samples from the
patient are required and the detection at early stages can be achieved. An extremely
ultralow LOD of 23 aM was achieved thank to the implementation of the memristive
aptasensors. The LOD achieved was the best ever obtained among electrochemical
biosensors for PSA so far reported in literature (Table 5.2).
Furthermore, the nanofabricated structures are exposed to PSA prepared in non-
diluted, full human serum considering concentrations below the clinical range, offer-
ing a proof of the capability of the sensor to function in extremely low concentrations
of biomarkers and the acquisition of the increasing trend resulted by the introduction
of the increasing negative charge on the surface of the nanodevice.
5 Memristive Biosensors for Ultrasensitive … 151
Having demonstrated the direct and highly efficient response of the nanobiosen-
sor prototype to accurately follow the various steps of DNA aptamer binding-
regeneration cycle, the memristive properties of the nanosensors are further leveraged
for the label-free, ultrasensitive detection of therapeutics compounds (drugs), bring-
ing a completely new perspective for the label-free monitoring personalized and
precise medicine. Ultrasensitive drug screening is a key aspect in the field of thera-
peutics. As therapeutic compounds are going to be supplied in less and less concen-
tration, the need for more sensitive detectors presents immense importance. There-
fore, memristive aptasensors resulting in ultrasensitive sensing outputs with cancer
biomarkers were implemented for effective ultrasensitive drug screening as well.
The implementation of DNA aptamers also offers the potential for the nanosensor
regeneration, opening the way for continuous monitoring of therapeutic compounds,
a very significant requirement in therapeutics. To better show the performance of
the proposed new biosensors, Tenofovir (TFV), an antiviral drug for HIV treatment,
is considered here as a model drug. The therapeutic range concentration of TFV in
the circulatory plasma of some nM up to 860 nM. TFV-aptamers (5’-Aptamer-C6
Amino-3’) developed for specific interaction with TFV were immobilized on the
surface of the memristive devices and the detection was performed for drug con-
centrations belonging within and slightly below the clinical range, opening to the
possibility for future applications with minimum requirements of amount of clinical
samples (Fig. 5.12).
Ids
~
Vds
NiSi
NiSi
SiO2
Si
Fig. 5.12 Schematic representation illustrating the memristive sensor, and SEM micrograph depict-
ing the Si-NW arrays anchored between the NiSi pads, which serve as electrical contacts of the
freestanding memristive nanodevice. The position of the current minima for the forward and the
backward regimes changes after the surface treatment introducing a voltage difference in the semi-
logarithmic current-to-voltage characteristics (Reference [33]-Reproduced by permission of The
Royal Society of Chemistry)
152 I. Tzouvadaki et al.
(a) (b)
Blank Blank
Concentration [pM] Concentration [nM]
Fig. 5.13 Analytical performance and effective drug detection through the electrical hysteresis
variations in buffer (a) and in full human serum (b). For the in serum detection, a new drug detection
is performed following regeneration of the memristive aptasensor. The response of the sensor to
the new drug binding fits ideally the calibration curve obtained initially. The exposure of the sensor
directly after, to a nonspecific drug, does not result in any signal difference (Reference [33]-
Reproduced by permission of The Royal Society of Chemistry)
The regeneration properties of the DNA aptamers were directly reflected on the
memristive aptasensors response as expressed through a voltage difference appear-
ing in the semi-logarithmic current-to-voltage characteristics. The whole sensing
cycle consisting of aptamers immobilization, target-drug binding, aptamer regenera-
tion, and target-drug rebinding was for first time portrayed through the variations of
the electrical hysteresis of the memristive nanostructures. A series of denaturation,
pH shocking, and refolding is used for surface regeneration (unbound the target from
aptamer and restore the aptamers to their previous folded and functional form). After
treatment with DNA aptamers, the electrical response of the nanodevices indicated
a voltage difference of 116 ± 34 mV. Following the nanodevices exposure to the
negatively charged target drug solution of 100 nM, an increase of the voltage differ-
ence occurs as an aftereffect of the drug binding, reaching the value of 156 ± 24 mV
(increase of 34.5%). However, the most promising outcome undoubtedly lays on the
electrical response after the regeneration of aptamers that follows the drug binding.
It was demonstrated that the voltage difference was decreased after the regeneration
process back to the blank level (to the value of 120 ± 15 mV), namely, at the level
that corresponds to the value obtained initially just after the aptamer immobilization
on the surface. In order to verify the reliability of the nanobiosensor in repeated mea-
sures and the capability for further capturing and detecting the target molecules after
154 I. Tzouvadaki et al.
0.3
5. Drug Binding
Voltage Gap [V]
0.2
3. Drug Binding
0.1
0.0
K
nM
uM
2
N
N
LA
IO
IO
0
1
10
FV
T
T
B
A
FV
T
R
R
E
E
T
N
E
E
G
G
E
E
R
Fig. 5.14 DNA aptamer immobilization, target molecule binding, and DNA aptamer regenera-
tion cycle, illustrated through the electrical hysteresis variations (Reference [33]-Reproduced by
permission of The Royal Society of Chemistry)
5 Memristive Biosensors for Ultrasensitive … 155
5.6 Conclusions
As we have seen in this chapter, the coupling of memristive effect with biological pro-
cesses gives us innovative nanobiosensing new technologies with un-precedent high
performance in both diagnostics and therapeutics. Silicon nanowire arrays exhibiting
a memristive effect and sophisticated bio-functionalization give rise to now kind of
biological sensors: the Memristive Biosensors. This completely new class of biosen-
sors successfully addresses the issues of an early detection of cancer since it provides
high performance, ultrasensitive, label-free electrochemical sensing of extremely
small traces of cancer biomarkers, such as the PSA, as well as effective screen-
ing and continuous monitoring of therapeutic compounds, such as TFV, also in full
human undiluted serum. This powerful capability of such a new approach opens to
new solutions in the medical practice, especially in the field of personalized and
precision medicine.
References
14. M.D. Ventra, Y.V. Pershin, L.O. Chua, Circuit elements with memory: memristors, memca-
pacitors, and meminductors. Proc. IEEE 97(10), 1717–1724 (2009)
15. A. Gelencsér, T. Prodromakis, C. Toumazou, T. Roska, Biomimetic model of the outer plexiform
layer by incorporating memristive devices. Phys. Rev. E 85(4), 041918 (2012)
16. J.J. Yang, M.D. Pickett, X. Li, D.A. Ohlberg, D.R. Stewart, R.S. Williams, Memristive switching
mechanism for metal/oxide/metal nanodevices. Nat. Nanotechnol. 3(7), 429–433 (2008)
17. S. Shin, K. Kim, S.M. Kang, Compact models for memristors based on charge-flux constitutive
relationships. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 29(4), 590–598 (2010)
18. D. Biolek, M. Di Ventra, Y.V. Pershin, Reliable spice simulations of memristors, memcapacitors
and meminductors. Radioengineering 22 (2013)
19. I. Vourkas, A. Batsos, G.C. Sirakoulis, Spice modeling of nonlinear memristive behavior. Inter.
J. Circuit Theory Appl. 43(5), 553–565 (2015). CTA-13-0128
20. F. Puppo, F.L. Traversa, M.D. Ventra, G.D. Micheli, S. Carrara, Surface trap mediated electronic
transport in biofunctionalized silicon nanowires. Nanotechnology 27(34), 345503 (2016)
21. Z. Biolek, D. Biolek, V. Biolkova, Spice model of memristor with nonlinear dopant drift.
Radioengineering 18(2), 210–214 (2009)
22. S. Benderli, T.A. Wey, On spice macromodelling of tio2 memristors. Electron. Lett. 45(7),
377–379 (2009)
23. A. Rak, C. Gyorgy, Macromodeling of the memristor in spice. Trans. Comp. Aided Des. Integ.
Circuit Sys. 29, 632–636 (2010)
24. I. Tzouvadaki, F. Puppo, M. Doucey, G.D. Micheli, S. Carrara, Computational study on the
electrical behavior of silicon nanowire memristive biosensors. IEEE Sens. J. 15(11), 6208–6217
(2015)
25. S.H. Lee, Y.S. Yu, S.W. Hwang, D. Ahn, A spice-compatible new silicon nanowire field-effect
transistors (snwfets) model. IEEE Trans. Nanotechnol. 8, 643–649 (2009)
26. H. Elhadidy, J. Sikula, J. Franc, Symmetrical current-voltage characteristic of a metal-
semiconductor-metal structure of schottky contacts and parameter retrieval of a cdte structure.
Semicond. Sci. Technol. 27(1), 015006 (2012)
27. S. Lee et al., Equivalent circuit model of semiconductor nanowire diode by spice. J. Nanosci.
Nanotechnol. (2007)
28. C.Y. Yim, et al., Electrical properties of the zno nanowire transistor and its analysis with
equivalent circuit model. J. Korean Phys. Soc. 48, 1565–1569 (2006)
29. K. Steiner, Capacitance-voltage measurements on schottky diodes with poor ohmic contacts.
IEEE Trans. Instrum. Meas. 42(1), 39–43 (1993)
30. M. Bleicher, E. Lange, Schottky-barrier capacitance measurements for deep level impurity
determination. Solid State Electron. 16(3), 375–380 (1973)
31. P.S. Ho, E.S. Yang, H.L. Evans, X. Wu, Electronic states at silicide-silicon interfaces. Phys.
Rev. Lett. 56(1), 177–180 (1986)
32. J. Werner, A.F.J. Levi, R.T. Tung, M. Anzlowar, M. Pinto, Origin of the excess capacitance at
intimate schottky contacts. Phys. Rev. Lett. 60(1), 53–56 (1988)
33. I. Tzouvadaki, N. Aliakbarinodehi, G. de Micheli, S. Carrara, The memristive effect as a novelty
in drug monitoring. Nanoscale 9(27), 9676–9684 (2017)
34. S. Shigdar, J. Lin, Y. Yu, M. Pastuovic, M. Wei, W. Duan, Rna aptamer against a cancer stem
cell marker epithelial cell adhesion molecule. Cancer Sci. 102(5), 991–998 (2011)
35. D.H.J. Bunka, P.G. Stockley, Aptamers come of age -at last. Nat. Rev. Microbiol. 4(8), 588–596
(2006)
36. E. Levy-Nissenbaum, A.F. Radovic-Moreno, A.Z. Wang, R. Langer, O.C. Farokhzad, Nan-
otechnology and aptamers: applications in drug delivery. Trends Biotechnol. 26(8), 442–449
(2008)
37. M. Souada, B. Piro, S. Reisberg, G. Anquetin, V. Noël, M. Pham, Label-free electrochemical
detection of prostate-specific antigen based on nucleic acid aptamer. Biosens. Bioelectron. 68,
49–54 (2015)
38. P. Jolly, N. Formisano, J. Tkáč, P. Kasák, C.G. Frost, P. Estrela, Label-free impedimetric
aptasensor with antifouling surface chemistry: a prostate specific antigen case study. Sens.
Actuators B Chem. 209, 306–312 (2015)
5 Memristive Biosensors for Ultrasensitive … 157
39. B. Liu, L. Lu, E. Hua, S. Jiang, G. Xie, Detection of the human prostate-specific antigen
using an aptasensor with gold nanoparticles encapsulated by graphitized mesoporous carbon.
Microchim. Acta 178(1), 163–170 (2012)
40. B. Kavosi, A. Salimi, R. Hallaj, F. Moradi, Ultrasensitive electrochemical immunosensor for
psa biomarker detection in prostate cancer cells using gold nanoparticles/pamam dendrimer
loaded with enzyme linked aptamer as integrated triple signal amplification strategy. Biosens.
Bioelectron. 74, 915–923 (2015)
41. Z. Yang, B. Kasprzyk-Hordern, S. Goggins, C.G. Frost, P. Estrela, A novel immobilization
strategy for electrochemical detection of cancer biomarkers: DNA-directed immobilization of
aptamer sensors for sensitive detection of prostate specific antigens. Analyst 140(8), 2628–2633
(2015)
42. P. Jolly, V. Tamboli, R.L. Harniman, P. Estrela, C.J. Allender, J.L. Bowen, Aptamer-mip hybrid
receptor for highly sensitive electrochemical detection of prostate specific antigen. Biosens.
Bioelectron. 75, 188–195 (2016)
43. K. Radhapyari, P. Kotoky, M.R. Das, R. Khan, Graphene-polyaniline nanocomposite based
biosensor for detection of antimalarial drug artesunate in pharmaceutical formulation and bio-
logical fluids. Talanta 111, 47–53 (2013)
44. B. Bo, X. Zhu, P. Miao, D. Pei, B. Jiang, Y. Lou, Y. Shu, G. Li, An electrochemical biosensor
for clenbuterol detection and pharmacokinetics investigation. Talanta 113(9), 36–40 (2013)
45. D.-M. Kim, M.A. Rahman, M.H. Do, C. Ban, Y.-B. Shim, An amperometric chloramphenicol
immunosensor based on cadmium sulfide nanoparticles modified-dendrimer bonded conduct-
ing polymer. Biosens. Bioelectron. 25(3), 1781–1788 (2010)
46. W.U. Wang, C. Chen, K.-H. Lin, Y. Fang, C.M. Lieber, Label-free detection of small-molecule-
protein interactions by using nanowire nanosensors. Proc. Natl. Acad. Sci. USA 102(3), 3208–
3212 (2005)
47. H. Karimi-Maleh, F. Tahernejad-Javazmi, N. Atar, M.L. Yola, V.K. Gupta, A.A. Ensafi, A novel
DNA biosensor based on a pencil graphite electrode modified with polypyrrole/functionalized
multiwalled carbon nanotubes for determination of 6-mercaptopurine anticancer drug. Ind.
Eng. Chem. Res. 54(4), 3634–3639 (2015)
48. R.N. Goyal, V.K. Gupta, S. Chatterjee, Voltammetric biosensors for the determination of parac-
etamol at carbon nanotube modified pyrolytic graphite electrode. Sens. Actuators B Chem.
149(8), 252–258 (2010)
49. Ľ. Švorc, J. Sochr, P. Tomčík, M. Rievaj, D. Bustin, Simultaneous determination of paraceta-
mol and penicillin V by square-wave voltammetry at a bare boron-doped diamond electrode.
Electrochim. Acta 68(4), 227–234 (2012)
50. K. Radhapyari, P. Kotoky, R. Khan, Detection of anticancer drug tamoxifen using biosensor
based on polyaniline probe modified with horseradish peroxidase. Mater. Sci. Eng. C Mater.
Biol. Appl. 33, 583–587 (2013)
51. M.N. Kammer, I.R. Olmsted, A.K. Kussrow, M.J. Morris, G.W. Jackson, D.J. Bornhop, Charac-
terizing aptamer small molecule interactions with backscattering interferometry. Analyst 139,
5879–5884 (2014)
52. M. Simiele, C. Carcieri, A. De Nicolò, A. Ariaudo, M. Sciandra, A. Calcagno, S. Bonora, G. Di
Perri, A. D’Avolio, A LC-MS method to quantify tenofovir urinary concentrations in treated
patients. J. Pharm. Biomed. Anal. 114(10), 8–11 (2015)
53. R. Jain, R. Sharma, Cathodic adsorptive stripping voltammetric detection and quantification
of the antiretroviral drug tenofovir in human plasma and a tablet formulation. J. Electrochem.
Soc. 160(8), H489–H493 (2013)
54. M.E. Barkil, M.-C. Gagnieu, J. Guitton, Relevance of a combined uv and single mass spec-
trometry detection for the determination of tenofovir in human plasma by hplc in therapeutic
drug monitoring. J. Chromatogr. B 854(7), 192–197 (2007)
Chapter 6
Optimized Programming
for STT-MTJ-Based TCAM
for Low-Energy Approximate Computing
Abstract In the advent of data-driven systems and processes, high speed and
energy-efficient computing techniques are highly desirable. Such systems and tech-
niques are already being employed in many applications, which mainly depends on a
huge amount of data like information analysis, transmission, policy, decision-making,
etc. An electronic system used in these applications, require to perform the opera-
tions like data capture, storage, visualization, and -analysis. Most of such systems
employ content addressable memories (CAMs), also known as associative memo-
ries for high-speed data search/compare and compute operation. In this chapter, an
optimized programming scheme for magnetic tunnel junction (MTJ) based resistive
ternary content addressable memory (ReTCAM) for approximate computing (AC)
application is presented. Basic key concepts related to MTJ structure, physics, elec-
trical behavior, bit-cell design, and AC are also discussed. Error-tolerant behavior
of AC and stochastic writing of ReTCAM cell are exploited to achieve low write
energy. Case study of 3-bit (LSB) write operation using the proposed programming
scheme is also investigated based on distance match accuracy. ReTCAM bit-cell is
designed using perpendicular magnetic anisotropic (PMA) MTJ device with 32 nm
diameter and 90 nm CMOS technology.
6.1 Introduction
Data and network-centric applications involve operations like pattern matching, clas-
sification, similarity search, and online learning [1–3]. Such computations require
massive parallel data processing (associative computing) to achieve high-speed and
efficient computational capability [4].
There are several existing designs and architectures including both pure CMOS based
and emerging nonvolatile memory (NVM) based TCAM. There are few primary and
important design parameters for TCAM hardware implementation [7], i.e.,
1. Energy: Due to massive parallel operations, TCAMs are bound to consume large
energy. Hence, energy-efficient designs are highly desirable.
2. Area: TCAM is used in constraint environment for a very specific purpose. The
need for parallel architecture leads to the large footprint that mainly depends on
TCAM cell area. The need of area efficient architecture is unavoidable.
3. Speed: The main purpose to employ a TCAM is its speed of the operation. A fast
TCAM is always preferable.
4. Noise Margin: TCAM cell has to store three different Logic values, unlike CAM
which stores two. An optimized and sufficient noise margin among all the states
of TCAM is required to avoid any logic error while read/search operation.
162 A. Kumar and M. Suri
Fig. 6.2 Different implementations of ReTCAM bit cells using MTJ devices: a 9T-2MTJ cell having
nine transistors and two MTJ devices; used to design a dual rail TCAM for soft error tolerance [22],
b 6T-2MTJ cell is used to design segmented match line (ML) to reduce the dynamic search power
[23], and c 4T-2MTJ cell with comparatively lower footprint with high-speed operation [24]
MTJ device stack consists of several material layers. However, three layers (one oxide
layer and two magnetic layers) play an important role. Thin oxide layer is inserted
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 163
between the two magnetic layers. One of the two magnetic layers is called as free
layer and another one is called as a fixed/pinned layer. Figure 6.3 illustrates the MTJ
device stack. The spin transfer torque (STT) based perpendicular magnetization MTJ
(PMTJ) functionality are enabled mainly by two phenomena:
1. STT effect for writing.
2. The tunnel magnetoresistance (TMR) effect for reading.
The magnetic orientation of the free layer can be switched reversibly between
parallel (P) and antiparallel (AP) with respect to the magnetic orientation of fixed
layer, by using STT mechanism [25]. The magnetization switching of the free layer
is achieved by passing a current through the stack with appropriate amplitude and
polarity (Fig. 6.3). Based on P and AP magnetization of free layer with reference to
the fixed layer, MTJ structure results in two different electrical resistances referred
to as RP (low resistance state) and RAP (high resistance state), respectively, as shown
in Figs. 6.3 and 6.4a. The sensing/reading performance of a MTJ depends on TMR
ratio. High TMR ratio helps in accurate detection of the MTJ resistance states both
in memory and Logic circuits. The TMR ratio is defined by the (6.1) [13].
(RAP − RP )
TMR Ratio = (6.1)
RP
The TMR ratio is directly linked to calculate the resistance window between P and
AP states as shown in Fig. 6.4a. MTJ devices show an inherent switching probability
based on the applied switching conditions, also referred to as switching stochasticity
[14]. One such case of switching probability or stochasticity is illustrated in Fig.
6.4b) where MTJ undergoes switching during the first cycle but does not switch
in the second cycle even with the application of same voltage pulse (Vpulse ). This
switching stochasticity is referred to as intrinsic switching variability that impacts
the cycle-to-cycle switching of the MTJ device.
164 A. Kumar and M. Suri
For the analysis presented in this chapter, a Verilog-A compact model [14] of the
STT-PMTJ is used for all circuit simulations. This model includes both intrinsic as
well as extrinsic variability of MTJ device. Intrinsic variability affects the cycle to
cycle switching whereas extrinsic variability affects device-to- device switching of
the MTJ devices. Switching conditions and statistics are studied numerically and
condensed in an analytical model, as described in [14]. Switching distribution is
described by a gamma distribution, whose mean and skewness are obtained as a
function of the ratio of the programming voltage to the critical voltage (Vc), as
shown in Fig. 6.4c.
For voltages that are significantly higher than the critical voltage Vc, it is observed
that the switching probability distribution versus pulse duration resembles a Gaussian
distribution, with average switching time inversely proportional to Vpulse . As the
voltage is decreased, the average switching time tends to increase exponentially.
Furthermore, the skewness also increases, thus making the probability switching
distribution function more asymmetric. Inset of Fig. 6.4c illustrates the switching
distribution of the MTJ at switching voltage, i.e., Vpulse = 2Vc. Table 6.1 lists some
of the MTJ device’s physical parameters that are used for the circuit level simulations.
TMR ratio (presented in Table 6.1) as 120% at room temperature has been reported
for same PMTJ stack with CoFeB/MgO/CoFeB layers [26].
MTJ devices exhibit device-to-device switching variation due to the process varia-
tion (difference in device’s geometry and other physical parameters) [13, 27]. Hence,
the programming pulse for obtaining a particular switching probability (say 50% )
is different for different MTJ devices. The Model helps to predict the mean switch-
ing time of the devices under the intrinsic and extrinsic variations. Facilitation of
Monte Carlo simulations-based analysis over MTJ’s intrinsic and extrinsic variation
projects the complete behavior of MTJ devices [28]. A switching time distribution
while switching the MTJ device from AP → P state over 1000 cycles is shown in
Fig. 6.5.
The mean switching time of a single MTJ is extracted from this distribution. Fur-
thermore, mean switching times for 1000 MTJ devices (considering spatial device-
to-device variability) are also shown Fig. 6.6 by using Monte Carlo simulations [28].
166 A. Kumar and M. Suri
This section explains the ReTCAM cell schematic, its working and summarized the
effect of MTJ variability on the TCAM cell operation.
The ReTCAM cell is designed to store three different logic states, i.e., “0”, “1”,
and “X” (Don’t-Care). An area efficient nonvolatile hybrid 4T-2MTJ ReTCAM cell
structure used in this chapter is inspired from [24]. 4T-2MTJ cell is shown in Fig. 6.7.
Searching and writing are the two basic operations of the ReTCAM cell. Table 6.2
illustrates all Logic states of a ReTCAM cell and corresponding resistance states of
the MTJ devices. 4T-2MTJ ReTCAM cell structure has two MTJ devices (MTJ1 and
MTJ2), two nMOS comparison transistors (NM1 and NM2), a write control pMOS
transistor (PM1), and a match line-driver nMOS transistor (NM3, connected in diode
configuration). Gate and drain of NM3 are connected to the ML. Data signals as well
Table 6.3 ReTCAM cell truth table for search and write operations
Operation ML BL BLb PM1 NM3 NM1 NM2
Search High – – OFF ON OFF ON
“0”
Search High – – OFF ON ON OFF
“1”
Write “0” Low Low High ON OFF ON OFF
Step-1
Write “0” Low High Low ON OFF OFF ON
Step-2
Write “1” Low High Low ON OFF ON OFF
Step-1
Write “1” Low Low High ON OFF OFF ON
Step-2
Write “X” Low High Low ON OFF ON ON
as control signals for both search and write operations are multiplexed on the same
lines so that only one set of signals is required at a time (Fig. 6.7).
Search Operation: During search operation, every ReTCAM cell compares an
input bit/logic on its SL and SLb lines with the prestored bit in ReTCAM cell as
illustrated in Table 6.2. ML is always pre-charged to a fixed potential before beginning
with a search operation. For searching “0” as well as “1” in the ReTCAM cell, the
corresponding transistors operation are illustrated in Table 6.3. For search “1”, SL
and SLb will be at high and low voltages, respectively and vice versa for search
“0” operation. Either NM1 or NM2 creates a discharge path for the ML, and then
resistances of MTJ1 and MTJ2 decide the amount of discharge within a fixed time
frame.
ML potential is then compared with a threshold value. If ML is higher than the
threshold, the result is match otherwise, the result is a mismatch. The ML discharge
current through the cell can be optimized by (1) adjusting the initial precharge of ML,
(2) size of NM3, and (3) gate control voltages and sizes of access transistors (NM1
and NM2). SL and SLb voltage levels can be tuned corresponding to the desired
noise margin and operation speed. In this study, ML is pre-charged to 1.2 V and we
observed ∼10 µA and ∼15.5 µA as match and mismatch currents, respectively, with
168 A. Kumar and M. Suri
a noise margin of ∼312 mV at search latency of 140 ps. If a cell is in the don’t care
state “X”, both MTJ devices will be in their high resistance state, and the result of
any search operation on the cell will always be a match.
Write Operation: The write signals are multiplexed with the search signals on
the same line as shown in Fig. 6.7. It can be figured out from the earlier discussion
that search operation is a single step process. However, write is a two-step process
for write “0” and write “1”, whereas, it is a single-step operation for write “X” as
presented in Table 6.3. During write operation, WEN signal enables PM1 and WL
signals appear at the access transistors (NM1 and NM2). BL and BLb are always
precharged at opposite voltage potentials and hence decide the direction of the current
flowing through the MTJ devices. Based on the direction of current, MTJ devices
get switch from either AP → P state or vice versa as explained earlier in Sect. 6.2.
WL1 and WL2 determines the order of access for MTJ devices programming in the
ReTCAM cell. Write energy and latency are found to be the same for a bit level cell
writing from state “0” to “1” and vice versa.
If only one MTJ has to be written to program the cell, then the write probability
of the cell is equivalent to the switching probability of a single MTJ. We carried out
simulations, where the ReTCAM cell was written from: Logic “0” → “1”, Logic
170 A. Kumar and M. Suri
“0”→ “X”, Logic “1” → “X” and their respective vice versa cases. There were six
different writing combinations. These write simulations were performed for 1000
cycles to study the impact of MTJ stochastic switching on the ReTCAM cell. This
helps to derive a write condition for any targeted cell-level error tolerance (ET) while
optimizing the cell write energy and latency. As targeted ET value for an application
increases, the cell energy consumption can be further minimized by operating the
ReTCAM cells in a more probabilistic manner.
Several applications (i.e., similarity search, pattern matching, etc.) have a scope of
error/noise tolerance (ET) [11, 29]. For such computational paradigms, we propose
an approximate computing (AC) and low-energy write scheme that exploits the ET.
In the proposed scheme, few LSBs within each ReTCAM entry are written with
probabilistic write (as illustrated in Fig. 6.10). Such probabilistic write of the LSBs
lowers overall energy consumption.
For a multi-bit hybrid ReTCAM, we present a case in which three LSBs are written
in a probabilistic manner. The number of LSBs with probabilistic programming may
change depending on the actual use case. LSBs are written for different cell-level
targeted ET (i.e., 0.1, 0.3, and 3%) using probabilistic write conditions whereas rest
of the bits are considered to be written with deterministic write conditions. The
targeted ET 0.1% is considered as the minimum/reference ET for comparative study
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 171
Fig. 6.11 ReTCAM cell average write energy per bit dependence on cell write probability, voltage
(BL/BLb) and latency. Dashed curves represent energy consumption, whereas, solid curves represent
probability
of write energy and latency. Results obtained using probabilistic “0” → “1” (or vice
versa), and “0” (or “1”) → “X” LSB writes are summarized below.
In this case, we write each LSB cell from Logic “0” to Logic “1” for 1000 cycles.
The impact of MTJ stochastic switching on ReTCAM cell performance parameters
with write voltage scaling (WVS) (1.2 to 0.6 V) is illustrated in Fig. 6.11. Average
write energy per bit and latency for a targeted cell-level ET (i.e., 0.1%) are shown
in Table 6.5. Clearly from Fig. 6.11 and Table 6.5, the energy consumption reduces
at the cost of latency with WVS. To further investigate, we simulated the cell for
different values of ET (making the MTJ switching more probabilistic). We achieved
the best trade-off between cell write voltage and both write energy and latency at 1 V.
Voltage drop (i.e., Vpulse ) across the MTJ devices in the cell was ≈ 2Vc at 1 V write
voltage. From Fig. 6.4c, the MTJ programming voltage (around 2Vc) is the minimum
voltage at which skewness curve is almost flat. Switching distribution curve (inset
of Fig. 6.4c) is nearly symmetric at 2Vc. The performance improvement per LSB
with different ET is illustrated in Table 6.6. Higher ET values (0.3 and 3%) lead to
a further improvement in the energy and latency.
Write “X” is a single-step process because only one MTJ needs to be switched into
high resistance state irrespective of the previously stored Logic state “0” or “1” in the
cell. Thus, Write “X” is relatively faster operation. Writing Logic “X” in the cell also
172 A. Kumar and M. Suri
Fig. 6.12 ReTCAM cell average write energy dependence on cell writing probability, write voltage
(BL/BLb), and write latency. Dashed curves represent energy, whereas, solid curves represent
probability
increases the matching sequence irrespective of search bit because, for such cells,
the search result will always match. Thus, some LSBs of ReTCAM entries can be
written with “X” (using probabilistic write) without significantly affecting the search
operation correctness. The impact of probabilistic write in ReTCAM with voltage
scaling from 1.2 to 0.6 V is shown in Fig. 6.12. Average write energy per bit, write
latency, and their improvement compared to Write “1” is shown in Table 6.7 (at fixed
ET percentage). It was found that writing “X” at 1 V showed better performance
compared to Write “1” for identical ET. Similarly, Table 6.8 illustrates the average
write energy per bit, latency, and their improvement factor compared to Write “1”.
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 173
6.5 Discussion
Write voltage scaling (WVS) was found to reduce energy, but at the cost of increased
latency, irrespective of the Logic state (“0” or “1” or “X”) written in LSB at a fixed
ET. However, LSB can be written with probabilistic write for AC applications, thus
optimizing the write energy-latency performance. Moreover, by writing the Logic
“X” using probabilistic write in LSBs, we further achieved higher performance of the
cell when compared with writing Logic “0”→“1”/“1”→“0” as illustrated in Table
6.8. Figure 6.13 shows that the overall improvement in average write energy per bit
and latency by the factor of 2.83x and 1.99x, respectively. Such improvements are
achieved using probabilistic write of “X” in LSBs (for ET = 3%) at write voltage 1
V, compared to Write “1” (for ET = 0.1%) at 1.2 V (equal to VDD). Simultaneously,
it also improves the search and write accuracy in terms of distance error (DE). The
DE is defined as the difference (in decimal value) between targeted write number
and the number which has been written after using probabilistic write. We observed
a trade-off between write energy per bit and DE.
Tables 6.9 and 6.10 show the achieved DE when the low probabilistic (for ET
as 3%) write scheme is used for 3 bits (LSB) while writing Logic “1” and “X”,
respectively. The worstcase DE of 0.21 is observed when 3 LSBs are written from
Logic state “000” to “111”. Worstcase DE with Write “X” ( 0.105), is half of the
worst case DE observed for Write “1”. Thus, Write “X” gives the better trade-off
between DE and write energy per bit, compared to Write “1”. This DE or inaccuracy
from the targeted number takes advantage of ET in AC applications.
174 A. Kumar and M. Suri
Table 6.9 Distance error (DE) due to probabilistic Write “0”→“1” of 3 LSBs for ET = 3% (1000
write cycles)
Targeted 3-LSB write Average written number Distance error (DE) from
operation Initial state:000 targeted number
(Binary/Decimal)
001/1 0.970 0.03
010/2 1.94 0.06
011/3 2.91 0.09
100/4 3.88 0.12
101/5 4.85 0.15
110/6 5.82 0.18
111/7 6.79 0.21
Table 6.10 Distance error (DE) due to probabilistic Write “0”→“X” of 3 LSBs for ET = 3% (1000
write cycles)
Initial state of 3 LSB (Binary) Targeted 3-LSB write Distance error from targeted
write
000 00X 0.015
000 0X0 0.03
000 0XX 0.045
000 X00 0.06
000 X0X 0.075
000 XX0 0.09
000 XXX 0.105
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 175
6.6 Conclusion
In this chapter, a fast and energy-efficient write scheme for hybrid ReTCAM is
presented which can be used in AC paradigms. The presented scheme exploited
the MTJ’s intrinsic stochastic switching coupled with the merit of writing Logic
don’t-care state for different levels of ET (i.e., 0.1, 0.3, and 3%). A device level
correlation between the optimum write voltage for the overall TCAM performance
was discussed in detail. The link of average switching time and skewness of switching
probability for the MTJ device to ReTCAM cell ensured the optimum programming.
The average write energy, as well as, write latency per bit got to improve by the
factor of 2.83x and 1.99x, respectively, while keeping a 3% ET. Worstcase write
inaccuracies for probabilistic write of LSBs is also decreased by a factor 2x by using
specific don’t care-based scheme. In AC paradigm using ReTCAM (n-bit), the LSBs
can be written with stochastic write conditions to significantly improve the write
performance without a significant drop in the match accuracy.
References
7.1 Introduction
Resistive switching memory has two distinct resistance states, high resistance
state (HRS) and low resistance state (LRS), which represent binary “0” and “1”
states [1, 2]. Ideally, once written, the state is maintained unless a voltage larger
than a threshold for switching is applied. One of the great advantages of resis-
tive switching memory lies in its scalability in comparison with capacitance-
based memory such as dynamic random access memory (RAM). That is, scal-
ing down resistive memory cells does not obviously cause significantly com-
plexity in cell design and its fabrication process as opposed to the capacitors
of dynamic RAM. A crossbar array (CBA) of such resistive switching memory
cells with selectors enables access to an arbitrary cell, realizing nonvolatile RAM.
This nonvolatile RAM is popularly referred to as resistive RAM (ReRAM for
short).
D. S. Jeong (B)
Division of Materials Science and Engineering, Hanyang University, Wangsimni-ro 222,
Seongdong-ku, 04763 Seoul, Republic of Korea
e-mail: [email protected]
Three conceivable strategies for ReRAM architecture are (i) CBA of cells with
active selectors (transistors), (ii) CBA with passive selectors (e.g., diode and ovonic
threshold switches [3–5]), and (iii) CBA without selector. Hereafter, the first two
are termed as active CBA and passive CBA, respectively. A fascinating attribute of
CBAs is an inherent occurrence of analog multiply-accumulate (MAC) operation
during the simultaneous application of reading voltages to the lines. The operation
is fully parallel such that the minimum complexity of MAC operation O(1) is ideally
achieved. Moreover, with real-valued conductance of each cell in the array, a matrix
with real-valued entries and its multiplication are possibly realized as follows:
⎛ ⎞ ⎛ ⎞⎛ ⎞
i1 g11 g12 · · · g1M v1
⎜ i2 ⎟ ⎜ g21 g22 · · · g2M ⎟ ⎜ v2 ⎟
⎜ ⎟ ⎜ ⎟⎜ ⎟
⎜ . ⎟=⎜ . .. .. .. ⎟⎜ .. ⎟, (7.1)
⎝ .. ⎠ ⎝ .. . . . ⎠⎝ . ⎠
iN gN 1 gN 2 · · · gN M vN
where vm , in , and gnm denote a voltage applied to the mth column, resulting current
measured at the nth row, and conductance of the cell between the mth column and
nth row, respectively. In this case, gnm takes a real value. Yet, the sneak current due
to the lack of selectors likely hinders correct operation given high line resistance
as a consequence of scaling down. The passive CBA also enables analog MAC
operation with the minimum complexity O(1). A remarkable advantage is that the
deployed passive selectors suppress sneak current so that the reliability of MAC
operation is likely improved relative to the CBA case. The selectors barely cause
an addition footprint as far as each selector and a memory cell are stacked. Yet, the
use of selectors may not be compatible with memory of real-valued conductance
given the contribution of the selector resistance to the total conductance. Instead, the
passive CBA is most compatible with binary resistive memory. The active CBA offers
an easier solution to the sneak current problem and compatibility with memory of
real-valued conductance. However, a higher operation complexity [>O(1)] and large
footprint of the transistors undermine the efficiency in MAC operation.
Such MAC operation is the heart of machine learning based on artificial neural
networks. An artificial neural network is a nonlinear hypothesis whose nonlinearity
arises from the activation functions that are referred to as neurons. Other than these
nonlinear activation functions, the whole calculation is linear in that the input into
neuron A is merely the weighted sum of outputs from the neurons in contact with
neuron A. This relation for the simple network in Fig. 7.1 is described by
⎛ ⎞ ⎛ ⎞⎛ L−1 ⎞
z 1L w11 w12 · · · w1M a1
⎜ zL ⎟ ⎜ w21 w22 · · · w2M ⎟ ⎜ a L−1 ⎟
⎜ 2 ⎟ ⎜ ⎟⎜ 2 ⎟
⎜ . ⎟=⎜ . .. .. .. ⎟⎜ .. ⎟, (7.2)
⎝ .. ⎠ ⎝ .. . . . ⎠⎝ . ⎠
L
zN wN 1 wN 2 · · · wN M aML−1
7 Greedy Edge-Wise Training of Resistive Switch Arrays 179
Fig. 7.1 Toy neural network. The circles denote activation functions (neurons)
where amL−1 , z nL , and wnm are the output of neuron m in layer L − 1, input to neuron n
in layer L, and the connection weight between neuron m and neuron n, respectively.
Note that for simplicity bias array is omitted. Its similarity to (1) is easily noticed,
rendering it possible to apply any types of CBAs to neural network calculation, which
potentially offers energy and time efficiency.
Section 7.1.1 is dedicated to addressing general framework of learning in machine
learning based on artificial neural network. Section 7.1.2 is dedicated to general
strategies for training CBAs for supervised learning. Section 7.2 addresses a recently
proposed greedy edge-wise training method suitable for on-chip learning.
Fig. 7.2 a Schematic of a feed-forward neural network including hidden layers. b A schematic of
an RBM
flow to the output layer. The feed-forward neural network varies in architecture
including the number of layers and connection configuration. The simplest archi-
tecture may be a single-layer neural network (perceptron) that consists of merely
input and output layer. The total N neurons in the output layer are fully connected to
the total M input neurons. Therefore, this network involves a single N × M weight
matrix. A multi-layer perceptron includes hidden layers between the input and out-
put layers, which improve classification accuracy by resulting in nonlinear decision
boundaries [6]. Each additional hidden layer needs one additional weight matrix. For
instance, a feed-forward neural network including L hidden layers involves L + 1
weight matrices, which creates a more workload. The convolutional neural network
(CNN) is another type of feed-forward neural network that has sparse and localized
connections as opposed to the perceptron [6, 7]. Such neural networks with hidden
layers are classified as deep neural network (DNN).
Generative learning can capture the structure of input data unlike discriminative
learning [8]. Instead, generative learning itself cannot endow a model with a classi-
fication function. Generative learning enables input data to be mapped onto a new
data space with different bases from the input space, which contrasts features of one
“implicit” class with the others. The network architecture for generative learning
obviously differs from the feed-forward neural network. The restricted Boltzmann
machine (RBM) is a typical example of generative learning suitable network architec-
ture [9]. An RBM is a probabilistic neural network that consists of visible and hidden
layers as illustrated in Fig. 7.3b. Each neuron in both layers serves as a feature or a
dimension. Akin to the feed-forward neural network, the RBM takes weight and bias
as model parameters. Input data are mapped onto the output layer depending upon
the model parameters so that the input data are described by different features (neu-
rons) in the output layer. If the output layer includes fewer neurons than the input, the
mapping implies a reduction in dimension. This case is referred to as dimensionality
reduction. Once input data are mapped onto the hidden layer throughout the weight
7 Greedy Edge-Wise Training of Resistive Switch Arrays 181
Fig. 7.3 a Schematic of a feed-forward neural network including hidden layers. b The procedures
of inference and backpropagation (training) for the feed-forward neural network
matrix and bias array, the data can be remapped onto the visible layer to recover
the original input data (autoencoding). An RBM is trained in such a way to increase
the equivalence between original input data and recovered (reconstructed) data. This
way also enhances the equivalence between arbitrary input data and those in the
hidden layer. The RBM, therefore, needs bidirectional data flow through the same
edges (connections), which obviously differs from the feed-forward neural network.
Though a unit RBM merely consists of two layers (Fig. 7.2b), multiple unit RBMs
can be stacked for repeated changes in dimension through the unit RBMs. Such a
network is referred to as a deep belief network (DBM) [10]. Such a DBM is trained
in a greedy layer-wise manner in that the first RBM unit from the input visible layer
is first fully trained and the following units are subsequently trained until the last
RBM unit [10].
output layer, and consequently, matrix wL+1 and bias array bL+1 are first optimized.
Subsequently, the cost for the Lth hidden layer (i.e., the difference between the
desired and actual outputs of the Lth hidden layer) is evaluated to modify matrix wL
and bias array bL . The desired output of the hidden layer is acquired from the cost
of the output layer so that the error propagates from the output to the lower layer.
This parameter update continues until w1 and bias array b1 . That is, the sequence
of parameter updates is from the output to input layer as opposed to inference.
This training process is termed backpropagation. Schematics of backpropagation
and inference are illustrated in Fig. 7.3.
Backpropagation or its modified (often simplified) algorithm is often used to train
a CBA [11–14]. It is often assumed that the CBA represents real-valued conductance.
A simple way (delta rule) is to first evaluate the error from the output layer [11–13].
It is often assumed that the CBA represents real-valued conductance. This error
determines the sign of a weight (conductance) change. A more complicated way
is to evaluate the error also from the output layer and accordingly program the
desired conductance in an iterative manner with conductance verification [11, 14].
As such, this algorithm requires error-evaluation and write-evaluation circuits so that
the consequent circuit overhead may outweigh the benefits from the efficient MAC
operation.
The Markov chain Hebbian learning (MCHL) algorithm [15] opens the way for train-
ing a CBA for supervised learning without a cost function. Instead of optimization of
the whole model looking up the error, the MCHL algorithm enables a local learning
rule defined between a pair of neurons to eventually optimize the model parameters
as a whole. Because, each edge between a pair of neurons is updated without global
function for the whole network, the MCHL is classified as a greedy edge-wise train-
ing algorithm. That is, adjusting each edge is believed to lead the energy of the entire
system to its minimum. Another significant feature of the MCHL algorithm is the
use of ternary weight w[i, j] ∈ {−1, 0, 1} not only for inference but also for training.
This distinguished this algorithm from binarizing real-valued weight at each update
step [16] as well as the use of auxiliary real-valued variables [17]. Other important
features are as follows:
(i) each weight is updated in a probabilistic manner,
(ii) given the finite states of weight and probabilistic update among the states, the
update process follows a finite-state Markov chain,
(iii) a group of neurons associatively represent a particular label, which is in line
with a concept cell [18],
(iv) a deep network is trained in a greedy layer-wise fashion.
To date, stochastic update of binary weight has been addressed in the frame-
work of stochastic Hebbian learning that accounts for the long-term potentiation
7 Greedy Edge-Wise Training of Resistive Switch Arrays 183
A unit network for the MCHL algorithm is analogous to an RBM. The unit network
has two layers of binary stochastic neurons without recurrent connection. However,
the main difference is this unit network is a feed-forward network so that the hidden
layer in the RBM is replaced by the output layer. This output layer does not feed
input into the input layer unlike the RBM. A schematic of a unit network including
M input features and N output neurons is illustrated in Fig. 7.4a. u1 and u2 mean
input and output arrays, respectively, each of which is, respectively, defined as
u1 ∈ R M , 0 ≤ u 1 [i] ≤ 1
u2 ∈ Z N , u 2 [i] ∈ {0, 1}
As such, H neurons in the output layer are grouped to associatively represent each
of total L labels so that N is equal to LH. Hereafter, such a group is referred to as a
bucket. When the L labels are indexed from 1 to L, u2 [(n − 1)H + 1:nH] is a block
of output activities for the nth label. Note that x[a:b] denotes a block ranging from
the ath to bth elements of vector x.
Weight matrix w is, therefore, an N × M matrix that defines the strength in
connectivity between a pair of neurons. As such, each entry of w takes one of the
ternary values, −1, 0, and 1. According to the bucket configuration in the output
layer, the weight matrix is also partitioned such that w[(n − 1)H + 1: nH, 1: M] is
for the connection from the input vector to the output neurons of the nth label.
The energy E of the network is given by
→
T
−
E(u1 , u2 ) = − 2u2 − 1 · w · u 1 + bT · u 2 , (7.3)
−
→
where 1 is a N-long vector filled with ones. b denotes a bias vector for the output
→
−
layer. 2u2 − 1 in (3) transforms u2 such that a quiet neuron (u2 [i] = 0) is given
an output of −1 rather than zero. This counts the cost of a positive connection (w[i,
j] = 1) between a nonzero input (u1 [j] = 0) and output neuron in an undesired label
(u2 [i] = 0) in supervised learning. This undesired connection raises the energy by
, u2 ) = e−E(u1 ,u2 )/τ /Z ,
u1 [j]. The joint probability distribution of u1 and u2 isP(u1
M
where Z is the partition function of the network, Z = j=1 i=1 N
e−E(u 1 [ j],u 2 [i])/τ . τ
184 D. S. Jeong
Fig. 7.4 a Basic network of M input and N output binary stochastic neurons (u1 and u2 : their
activity vectors). b Behavior
of P(u2 [i] = 1) with z[i] when
a[i] = 0. c Graphical description of
the weight matrix w w ∈ Z N ×M ; w[i, j] ∈ {−1, 0, 1} that determines the correlation between
the input activity u1 u1 ∈ R M ; 0 ≤ u 1 [i] ≤ 1 and output activity u2 u2 ∈ Z N ; u 2 [i] ∈ {0, 1} .
This
weight matrix w evolves
in accordance to given pairs of an input u1 and write vector
v v ∈ Z N ; v[i] ∈ {−1, 1} , ascertaining the statistical correlation between u1 and v by following
the sub-updates. d Potentiation: a weight component at the current step t (wt [i, j]) has a nonzero
probability to gain +1 (i.e., wt [i, j] = 1) only if u1 [j] = 0, v[i] = 1, and wt [i, j] = 1; for instance,
given u1 = (0, 1, 0, …, 0) and v = (1, −1, −1, …, −1), wt [1, 2] has a probability of positive update.
e Depression: all components wt [i, 2] (i = 1) are probabilistically subject to negative update (gain
−1) insofar as u1 [2] = 1, v[i] = −1, and wt [i, 2] = −1
i=1 −a[i]u 2 [i]− Mj=1 w[i, j]u 1 [ j]+2 j=1 u 2 [i]w[i, j]u 1 [ j] /τ
e
P( u2 |u1 ) = M
=
N −a[i]u 2 [i]− Mj=1 w[i, j]u 1 [ j]+2 j=1 u 2 [i]w[i, j]u 1 [ j] /τ
i=1 u 2 [i]∈{0,1} e
M
N
(u2 |u1 ) = P(u 2 [i]|u1 ).
i=1
e(−a[i]+2z[i])/τ 1
P(u 2 [i] = 1|z[i] ) = = . (7.6)
1+e (−a[i]+2z[i])/τ 1+e (a[i]−2z[i])/τ
This probability function for the binary stochastic neuron is plotted in Fig. 7.4b.
In the MCHL algorithm, supervision is realized by applying a field that directs input
pattern to its desired label. Directing input is implemented by (a) encouraging its
connection with an output neuron(s) with the desired label among L labels and (b)
discouraging otherwise—both in a probabilistic manner. To this end, a write vector
v that points to the correct label in the L-dimensional space is essential. Each label
is given a bucket of H neurons so that v is an LH-long vector. Given that all labels
are orthogonal to each other, each bucket of v, i.e., v[(n − 1)H + 1: nH]; 1 ≤
n ≤ L, offers each basis of the applied field in the L-dimensional space. v[a:b]
denotes a block ranging from the ath to bth elements. Only one element in each label
bucket of v is randomly chosen for each ad hoc update and given non-zero value
in that the element dedicated to the desired label is set to 1 while the other L − 1
elements to −1. This write vector v is renewed every update. Therefore, the update
is sparse. It is noteworthy that v[i] ∈ {−1, 1} when H = 1 and v[i] ∈ {−1, 0, 1}
otherwise.
Figure 7.4c graphically describes the feed-forward connection between u1 and
u2 for the topology in Fig. 7.4a. The matrix w
is loaded with ternary ele-
ments w ∈ Z N ×M ; w[i, j] ∈ {−1, 0, 1} and N = L H . Here the input vector u1 ∈
R M ; 0 ≤ u 1 [i] ≤ 1. Consequently, v ∈ Z N ; v[i] ∈ {−1, 0, 1}. According to the
bucket configuration of the write vector v, the matrix w is partitioned such that w[(n
− 1)H + 1:nH, 1:M] defines the correlation between the input and its label (n).
Likewise, z (= wu1 ) is also partitioned into H buckets in the same order as v, and
the same holds for the output activity vector u2 .
Every pair of u1 and v stochastically updates each component w[i, j] in w by
w[i, j] = wt+1 [i, j] − wt [i, j] ∈ {−1, 0, 1}. The variables determining w[i, j]
186 D. S. Jeong
include (a) u1 [j] and v[i], (b) current value of wt [i, j], and (c) output activity u2 [i] as
follows (also see Table 7.1).
Condition (a): it is probable that w[i, j] = 1 when u1 [j]v[i] > 0 (i.e., u1 [j] = 0
and v[i] = 1) and w[i, j] = −1 when u1 [j]v[i] < 0 (i.e., u1 [j] = 0 and v[i] = −1)
conditional upon (b) and (c). That is, w[i, j] is updated to connect the nonzero u1 [j]
and ith output neuron in the desired label (when v[i] = 1) and to disconnect when v[i]
= −1. The former and latter updates are referred to as potentiation and depression,
respectively (Fig. 7.4d, e). This condition is reminiscent of the Hebbian learning such
that w[i, j] is determined by u1 [j]v[i]. The larger the input u1 [j], the more likely the
update is successful such that both P+ (potentiation probability) and P− (depression
probability) scale with u1 [j]; P+ = u 1 [ j]P+0 and P− = u 1 [ j]P−0 , where P+0 and P−0
denote the maximum probability of potentiation and depression, respectively. Such a
negative update is equivalent to homosynaptic long-term depression in the biological
neural network, elucidated by the Bienenstock–Cooper–Munro theory supporting the
spontaneous selectivity development [20, 21].
Condition (b): The updates w[i, j] = 1 and w[i, j] = −1 given Condition (a)
are allowed if the current weight is not 1 (wt [i, j] = 1) and not −1 (wt [i, j] = −1),
respectively. This condition keeps w[i, j] ∈ {−1, 0, 1} so that the update falls into a
finite-state Markov chain.
Condition (c): Alongside Conditions (a) and (b), the updates w[i, j] = 1 and
w[i, j] = −1 require u2 [i] = 0 and u2 [i] = 1, respectively. That is, a quiet output
neuron (u2 [i] = 0) supports w[i, j] = 1, whereas an active one (u2 [i] = 1) supports
w[i, j] = −1.
As a consequence of these update conditions, the MCHL algorithm spontaneously
captures the correlation between input and write vectors (u1 and v) during repeated
Markov processes, which is exemplified in Supplementary Information for randomly
generated input and write vectors that have a statistical correlation.
As such, a learning rate is of significant concern for successful learning; a proper
rate that allows the matrix to converge to the optimized one should be chosen. The
same holds for the MCHL algorithm. The rate in the proposed algorithm is dictated
by P+0 and P−0 in place of an explicit rate term. For extreme cases such as P+0 = 1
and P−0 = 1, the matrix barely converges, but constantly fluctuates.
7 Greedy Edge-Wise Training of Resistive Switch Arrays 187
The MCHL algorithm can be applied to the handwritten digit recognition task with the
MNIST database (L = 10). Figure 7.5a shows the memory-centric network schematic
for the training, which encompasses one hidden layer. The weight matrices w1 and
w2 were trained in a greedy fashion in that w1 was first fully trained with input
vector u1 and write vector v1 , which was then followed by training matrix w2 with
u2 and v2 . The output vector a1 of the hidden neurons in response to each MNIST
dataset was taken as the input to matrix w2 . The training protocol was the same for
both matrices. For each training epoch, a chosen input dataset (28 × 28 matrix) was
converted to a u1 vector of 784 elements: u1 ∈ R784 ; 0 ≤ u[i] ≤ 1. A bucket of H 1
elements is assigned to each label in the v1 vector such that v1 is a 10 H-long vector
as illustrated in Fig. 7.5a. Every epoch with an input vector u1 randomly chooses one
of H 1 elements (rth element) in the buckets of the correct label; the chosen element
in v is set to 1, the rth elements in the other buckets (9 in total) to −1, and the rest
elements (9H in total) to 0. Therefore, in matrix w1 , the elements in only one row are
probabilistically subject to potentiation, those in the 9 rows to depression, and the rest
are invariant. That is, the update is sparse. Accordingly, matrix w1 is partitioned into
10 sub-matrices (see Fig. 7.5a). The sequence of the MCHL algorithm application
is tabulated in Table 7.1.
Fig. 7.5 a Schematic of the network architecture for handwritten digit recognition. A single HL is
included. The matrix w1 first maps the input vector u1 to the hidden neurons. The probability that
u2 [i] = 1 for all i’s is taken as an input vector to w2 that maps the input vector to the output neurons.
The write vector v1 has 10 (the number of labels) buckets, each of which has H 1 elements, i.e.,
N = 10 H 1 . Each thick arrow indicates an input vector to a group of neurons (each neuron takes
each element in the input vector). b Classification accuracy change in due course of training with
network depth (H 1 = 100, H 2 = 50, H 3 = 30). P+0 , P−0 , and were set to 0.1, 0.1, and 1, respectively
188 D. S. Jeong
The eventual
output of the entire network O is a vector of outputs of the whole
labels O ∈ Z10 ; the output of each label O[i] is the activity integration over neurons
2
in label i, O[i] = Hj=1 a2 [(i − 1)H + j] (Fig. 7.5a). The location of the maximum
coefficient in the output vector designates the estimated label for a given input. The
recognition accuracy was evaluated with regard to agreement between the desired
and estimated labels.
The weight matrix becomes larger with the bucket size, so is the memory allocated
for the matrix. Nevertheless, the benefit of deploying buckets at the expense of
memory is twofold. First, a number of features (pixels) are shared among labels
so that an individual feature should not exclusively be directed to a single particular
label. The use of buckets allows such common features to be connected with elements
over different labels given the sparse update on matrix w. For instance, without such
buckets, each attempt to direct the feature at (1, 1)—belonging to both labels 1
and 2—to label 1 probabilistically weakens its connection with label 2; however,
the sparse update perhaps leaves its connection with the other neurons in label 2
invariant. This feature-sharing characteristic is seemingly against competition, and
thus selectivity evolution. However, the use of buckets offers a solution to selectivity
evolution, which is the second benefit. As depicted in Fig. 7.5a, the 10 sub-matrices
in matrix w2 define 10 ensembles of H 2 output neurons; the final output from each
label O[i] is the sum of output over the neurons in the same label, i.e., the output
range scales with H 2 , i.e., 0 – H 2 . As for the training in Fig. 7.1, a single training is
unable to capture the statistical correlation between the input and write vectors due
to a large error; however, the larger the trial numbers, the less likely the statistical
error (noise) is incorporated into the data in line with error reduction in Monte
Carlo simulation with a number of random numbers. The use of buckets enables
the parallel acquisition of samples; therefore, it is conceivable that a larger bucket
size tends to improve recognition accuracy. However, benchmarking Monte Carlo
simulation, the error reduction with sample number tends to be negligible when the
number is sufficiently large. Additionally, the memory cost perhaps outweighs the
negligible improvement on the accuracy. Therefore, it is of practical importance to
reconcile the performance with the memory cost.
The network depth substantially alters the recognition accuracy as plotted in
Fig. 7.5b. Without the hidden layer the accuracy merely reaches approximately 88%
at H = 100 while deploying one HL improves the accuracy up to approximately
92% at H 1 = 100 and H 2 = 50. H 1 and H 2 denote the number of elements in write
vector v for w1 and w2 , respectively. Improvement on accuracy continues onwards
with more HLs (e.g., two HLs; blue curve in Fig. 7.5b) albeit slight in contrast to the
improvement by the first hidden layer.
7 Greedy Edge-Wise Training of Resistive Switch Arrays 189
7.3 Conclusion
References
1. D.S. Jeong, R. Thomas, R.S. Katiyar, J.F. Scott, H. Kohlstedt, A. Petraru, C.S. Hwang, Rep.
Prog. Phys. 75, 076502 (2012)
2. D.S. Jeong, K.M. Kim, S. Kim, B.J. Choi, C.S. Hwang, Adv Elec Mater 2, 1600090 (2016)
3. S.R. Ovshinsky, Phys. Rev. Lett. 21, 1450–1453 (1968)
4. D.S. Jeong, H. Lim, G.-H. Park, C.S. Hwang, S. Lee, Cheong B-k. J. Appl. Phys. 111, 102807
(2012)
5. K. DerChang, S. Tang, I.V. Karpov, R. Dodge, B. Klehn, J.A. Kalb, J. Strand, A. Diaz, N.
Leung, J. Wu, S. Lee, T. Langtry, C. Kuo-wei, C. Papagianni, L. Jinwook, J. Hirst, S. Erra, E.
Flores, N. Righos, H. Castro, Spadini G A stackable cross point Phase Change Memory. IEEE
Intl. Electron Devices Meeting 7–9(2009), 1–4 (2009)
6. Y. LeCun, Y. Bengio, G. Hinton, Nature 521, 436–444 (2015)
7. Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Proc. IEEE 86, 2278–2324 (1998)
8. D. Barber, Bayesian Reasoning and Machine Learning (Cambridge University Press, Cam-
bridge, United Kingdom, 2012)
9. G. E. Hinton, A practical guide to training restricted Boltzmann machines, in Neural Networks:
Tricks of the Trade, ed. by G. Montavon, G. B. Orr, K.-R. Müller (Second edn., Springer Berlin
Heidelberg, Berlin, Heidelberg, 2012), pp 599–619. https://doi.org/10.1007/978-3-642-35289-
8_32
10. G.E. Hinton, S. Osindero, Y.-W. Teh, Neural Comput. 18, 1527–1554 (2006)
11. P. Yao, H. Wu, B. Gao, S.B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang, N. Deng, L. Shi, H.S.P.
Wong, H. Qian, Nat Commun 8, 15199 (2017)
12. M. Prezioso, F. Merrikh-Bayat, B.D. Hoskins, G.C. Adam, K.K. Likharev, D.B. Strukov, Nature
521, 61–64 (2015)
13. F. Alibart, E. Zamanidoost, D.B. Strukov, Nat Commun 4, 2072 (2013)
14. M. Hu, C.E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang, R.S. Williams,
J.J. Yang, Q. Xia, J.P. Strachan, Adv. Mater. 30, 1705914 (2018)
190 D. S. Jeong
15. G. Kim, V. Kornijcuk, D. Kim, I. Kim, J. Kim, H.C. Woo, J.H. Kim, C.S. Hwang, D.S. Jeong.
arXiv:171108679 [csNE] (2017)
16. M. Courbariaux, Y. Bengio, J.-P. David. arXiv:151100363 2015
17. C. Baldassi, A. Braunstein, N. Brunel, R. Zecchina, Proc. Natl. Acad. Sci. 104, 11079–11084
(2007)
18. R.Q. Quiroga, Nat. Rev. Neurosci. 13, 587–597 (2012)
19. N. Brunel, F. Carusi, S. Fusi, Network: Computation in Neural Systems 9, 123–152 (1997)
20. E. Bienenstock, L. Cooper, P. Munro, J. Neurosci. 2, 32–48 (1982)
21. L.N. Cooper, M.F. Bear, Nat. Rev. Neurosci. 13, 798–810 (2012)
Chapter 8
mMPU—A Real Processing-in-Memory
Architecture to Combat the von
Neumann Bottleneck
Abstract Data transfer between processing and memory units in modern comput-
ing systems is their main performance and energy-efficiency bottleneck, commonly
known as the von Neumann bottleneck. Prior research attempts to alleviate the prob-
lem by moving the computing units closer to the memory that has had limited success
since data transfer is still required. In this chapter, we present mMPU memristive
memory processing unit, which relies on a memristive memory to perform computa-
tion using the memory cells, and therefore directly tackles the von Neumann bottle-
neck. In mMPU, the operation is controlled by a modified controller and peripheral
circuit without changing the structure of the memory cells and arrays. As the basic
logic element, we present Memristor-Aided loGIC (MAGIC), a technique to com-
pute logical functions using memristors within the memory array. We further show
how to extend basic MAGIC primitives to execute any arbitrary Boolean function
and demonstrate the microarchitecture of the memory. This process is required to
N. Talati (B)
Computer Science and Engineering Department, University Of Michigan, Ann Arbor, Michigan
48105, USA
e-mail: [email protected]
R. Ben-Hur · N. Wald · S. Kvatinsky
Electrical Engineering Department, Technion - Israel Institute of Technology, 3200003 Haifa,
Israel
e-mail: [email protected]
N. Wald
e-mail: [email protected]
S. Kvatinsky
e-mail: [email protected]
A. Haj-Ali
Electrical Engineering and Computer Science Department, University of California, Berkeley,
94720 Berkeley, CA, USA
e-mail: [email protected]
J. Reuben
Lehrstuhl für Informatik 3, Department Informatik (INF), Friedrich-Alexander-University (FAU)
Erlangen-Nürnberg, 91058 Erlangen, Bayem, Germany
e-mail: [email protected]
© Springer Nature Singapore Pte Ltd. 2020 191
M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_8
192 N. Talati et al.
enable data computing using MAGIC. Finally, we show how to build the computing
system using mMPU, which performs computation using MAGIC to enable a real
processing-in-memory machine.
8.1 Introduction
Fig. 8.1 a Abstract model of von Neumann architecture, where two separate units (CPU and
memory) are dedicated for data processing and data storage. These elements are connected through
a bandwidth(B/W)-limited bus for data transfer [35]. b Performance scaling of CPU and memory
with respect to time
8 mMPU—A Real Processing-in-Memory Architecture … 193
which suffer the most from the von Neumann bottleneck, can be efficiently executed
on the mMPU.
Early efforts in investigating PiM date back to the ’90s. Some famous proposals
include a configurable PiM chip that can operate as a conventional memory or as
a Single-Instruction Multiple-Data (SIMD) processor for data processing [12]. The
authors of Active pages [33] have proposed placing the CPU and configurable logic
elements next to the DRAM subarrays to speed up the processing. In Computational
RAM [11], the sense amplifiers of the random access memory are connected directly
to the SIMD pipelines. The Berkeley IRAM project [35, 36] advocated widening the
bandwidth between CPU and memory by designing them on the same die.
Early adaptation of PiM failed to gain widespread adoption because of four major
challenges [1]. The first challenge was inadequate implementation of technology.
Although prior proposals tried to integrate the memory and CPU on the same die,
the incompatible fabrication technologies of DRAM and CPU made it difficult to
incorporate these approaches in practical computing systems. The second was the
processor architecture that can use the high bandwidth enabled by proximity to
memory. Early PiM research required custom architectures, requiring considerable
design efforts and significant advancement in the developer community. The third
challenge was the development of interfaces that allowed PiM computing units as well
as external processing units to access memory. Early efforts required the design and
adoption of custom memory interfaces. The fourth challenge was the programming
models. Early approaches had to develop the programming abstractions from the
bottom up.
Today, the aforementioned challenges are being overcome by modern age with
the advancement in technologies and methodology involved in building computers.
For example, the first challenge has to be overcome by the emergence of 3D die
stacking, enabling heterogeneous integration of logic and memory, and emerging
memory technologies, facilitating 3D fabrication of memory arrays on top of CMOS
substrates [16]. The evolution of various other processing platforms, e.g., GPGPUs,
custom accelerators, etc. have solved the second problem by efficiently utilizing
the high bandwidth offered by the memory within the thermal constraints of the
memory modules [10]. Recent die-stacked memory interface standards (such as High
Bandwidth Memory [22]) and off-chip memory interfaces that expose load-store
semantics (such as Hybrid Memory Cube [21]) meet nearly all the memory interface
requirements of PiM, which surmounts the third challenge. Recent frameworks such
as Heterogeneous System Architecture [2] and the associated software tools for
accelerators have addressed the fourth challenge to widespread adoptation of PiM.
Although the advancement in technologies solve most of the aforementioned
problems, the current state-of-the-art technologies and future PiM proposals should
address the new set of issues such as workload heterogeneity (different algorithms
8 mMPU—A Real Processing-in-Memory Architecture … 195
STE
Mb
STE STE STE STE (State (2n-1)
(0) (1) (2) (y)
Transition Mb
Memory Array
(2n-2)
Element) Mb
Decoder
Mb - Memory bits
Input (2n-3)
Mb (1 X 2n-1)
Mb (1 X 2n-1)
Mb (1 X 2n-1)
(1 X 2n-1)
Symbol
Mb
(1)
Mb
(0)
State Transition STE
Logic Logic Logic Logic STE Enable
Clock Output
Inputs State
Output
Inputs
State Bit
Clock
Fig. 8.2 Modern PiM architecture—Micron’s Automata Processor (AP) [9], which exploits the
inherent bit-parallelism in DRAM for symbolic pattern matching by performing multiple operations
on a single data and by that reducing the number of memory accesses
present various memory layouts, access patterns, and involve computations with dif-
ferent degrees of parallelism and complexity) and fabrication challenges in memory
that can enable PiM.
One current state-of-the-art PiM concept is Micron’s Automata Processor (AP)
[9], as shown in Fig. 8.2. The AP natively implements the Nondeterministic Finite
Automata (NFA) paradigm in hardware. Thus, the AP is an accelerator designed
specifically for symbolic pattern matching. In this architecture, the input symbol
is provided to multiple memory arrays by decoding it, instead of the row address.
Automata operations are invoked through a routing matrix structure exploiting the
inherent bit-parallelism of traditional DRAM, enabling Multiple Instruction Sin-
gle Data (MISD) architecture. This architecture provides the flexibility to program
independent automata on a single silicon device [40]. Apart from the AP, several
other recent proposals for PiM enable the transition from DRAM to resistance-based
emerging Non-volatile Memory Technologies (NVRAM). These approaches include
the accelerators for enhancing artificial neural networks [3, 7], DDR3-compatible
interface with dual in-line memory modules (DIMM) capable of performing content
addressable searches [14], associative computing [15, 45], etc.
All of the previous approaches for addressing the von Neumann bottleneck using
PiM have relied on reducing the distance between the processing and the conven-
tional memory system, i.e., DRAM. Although DRAM has been exploited to its best
capabilities, these approaches still suffer from a fundamental problem—the need
to transfer data between the CPU and the memory. Because DRAM cells are inca-
pable of performing logical operations, systems with DRAM as a memory require a
separate resource to perform computation. Emerging memristive technologies, such
as Resistive Random Access Memory (RRAM or ReRAM) [27, 41, 42], enable a
new approach, where the computation of logical functions is done directly using the
196 N. Talati et al.
memory cells, without any need to instantiate additional CMOS blocks for process-
ing. In this chapter, the von Neumann bottleneck is solved by giving computational
capabilities directly to the memristive memory cells. Thus, the proposed approach
is fundamentally different than all the previously proposed techniques in PiM and
tackles the data movement issue directly.
In this section, we first describe the operation of the memristor crossbar array as
memory. Then, we present Memristor-Aided loGIC (MAGIC), a logic family that
enables the performance of logical operations within the memristive memory. We
further show how to integrate the MAGIC circuit within the memristive memory
array without requiring major modifications in the crossbar structure and techniques
to perform vector operations using MAGIC.
The memristor stores the logical value in terms of its resistance, in contrast to conven-
tional memories, which use a charge to represent data. This resistance is controlled
by applying voltage across the memristor. Memristors can be fabricated between
two metals, which act as the top and the bottom electrodes of a switching dielectric
material. Hence, memristors can be fabricated in the metal layers as part of a standard
CMOS Back End of Line (BEOL) process. Memristive memory generally utilizes
a crossbar structure, which enables an extremely dense memory array with mem-
ory cell area of 4F 2 , where F is the technology feature size. Figure 8.3 shows one
such design of a memristive memory crossbar array. Voltage drivers, row/column
decoders, and sense amplifiers are used as a part of the peripheral circuit to sup-
port write and read operations, similar to other memory technologies. To perform a
write operation, a write voltage Vwrite , higher than the threshold voltage (von and voff ,
which switches the memristor to LRS and HRS, respectively), is applied across the
target memristor through the wordlines and bitlines. For a memristor with asymmet-
ric switching characteristics (i.e., von = voff ), two different write voltages are applied
for writing logic 1 (i.e., VSET ) and 0 (i.e., VRESET ). Since during the write operation,
the voltage is applied through wordlines and bitlines, even the memristors adjacent
to the target memristors are partially influenced by this voltage, which may disturb
the state of the unselected memristor; this is known as the write disturb problem
[29]. Half-select voltages (typically Vwrite /2 or Vwrite /3 [6]) are applied to isolate the
nontarget memristors.
Read operations are performed by applying a voltage Vread , with a magnitude
lower than the threshold voltage for switching, and measuring the current passing
through the device using a sense amplifier (SA), as shown in Fig. 8.3. A primary
8 mMPU—A Real Processing-in-Memory Architecture … 197
Fig. 8.3 Crossbar structure of memristive memory array. Voltage controllers and sense amplifiers
are used to perform read, write, and logic operations. Example of a write operation by applying
Vwrite across the target memristors, and a read operation by applying Vread across the memristor and
measuring the current using a sense amplifier. Note that reads and write operations are performed
in time-multiplexed fashion
challenge for the read operation for memristive memory is the sneak path current
phenomenon [4, 31, 38, 47], which is due to the resistive nature of the memory cells:
the read voltage also creates additional current paths, different than the desired path,
and this additional current flow adds resistance in parallel to the selected memristor,
which depends on the stored data in the unselected memristors. There are several
ways to overcome this challenge [4, 17, 47], including modification of the memory
cell structure (i.e., using a diode/transistor/selector in series with the memristor)
and using different biasing schemes for the unselected lines (i.e., ground/half-select
biasing schemes).
Although the memristive memory crossbar structure is symmetrical, accessing
memory cells in a conventional memristive memory array is possible only from one
direction. Access from the other direction is blocked since only specific voltages can
be applied in each row/column, and the decoding and sensing circuits are connected to
a single edge of the array. To enable the access to memory cells from all sides, voltage
controllers and sense amplifiers can be added on both sides of the memristive memory
crossbar, constituting a memory called transpose memory [39]. Additional peripheral
circuitry would provide more flexibility to the memory array and would provide
capabilities to the memory system. Figure 8.4a illustrates the difference in peripheral
circuitry between k × m conventional and transpose memory crossbars. Figure 8.4b
shows the comparison of the ratio of total area utilized at CMOS and memristive layer
for different values of array sizes (i.e., k × k). The comparison shows that the ratio is
almost equal (which implies the area utilization) for large array sizes (i.e., k ≥ 100).
Note that this is a general comparison irrespective of the memristor technology used,
i.e., without considering the maximum allowed array size.
198 N. Talati et al.
(a) (b)
Fig. 8.4 a Comparison of additional supporting CMOS circuitry to facilitate logic implementa-
tion at nanocrossbar layer for k × m conventional and transpose memories, and b Ratio between
CMOS area (ACMOS ) and memristor area (AMEM ) for different array sizes (i.e., different k for k × k
arrays) for conventional and transpose memory crossbars. The area utilization at nanocrossbar layer
improves for larger arrays
All operations (read, write, and half-selecting cells) are performed in transpose
memory by application of similar voltages as in conventional memory, with the added
freedom of applying these voltages from both horizontal and vertical directions.
Furthermore, as described later in Sect. 8.3.2, transpose memory offers the additional
feature of transposing the logic execution in the columns of the array, whereas in
conventional memory, this is only possible over a memory row.
MAGIC is a stateful logic family [37], compatible for computation within the mem-
ristive memory [26]. In MAGIC, n-input memristors and a single-output memristor
are used to execute n-input Boolean functions (e.g., NOR, NAND, OR, AND, and
NOT). Some MAGIC gates, such as NOR and NOT, can be implemented within the
memristive memory crossbar array, not requiring any modification of the crossbar or
the memory cells. An additional voltage level is required, apart from read and write
voltages, in order to support the MAGIC execution within the memory. Figure 8.5b
shows the schematic of a two-input MAGIC NOR gate, where IN1 and IN2 are the
inputs of the NOR gate, and OU T is the output. The input memristors and the output
memristor are always connected in the reverse polarity as shown in Fig. 8.5b.
To execute the MAGIC NOR operation, the output memristor is initialized to
RON . A voltage V0 , higher than the threshold voltage for switching, is applied to the
input memristors, and the output memristor is grounded from the other terminal as
shown in Fig. 8.5c. Due to the resistive nature of memristors, the voltage is divided
between the input and output memristors. Consequently, the output switches from
RON to ROFF , only if both the inputs are logic 1, i.e., the voltage across the output
8 mMPU—A Real Processing-in-Memory Architecture … 199
memristor is high. The value of the MAGIC execution voltage V0 has to be within a
certain interval to ensure that the MAGIC gate works as expected. The value of V0
should be high enough to switch the output memristor during the MAGIC execution,
when all the inputs are logic 1, which sets the lower bound on V0 . Furthermore, the
value of V0 should be sufficiently low to prevent switching of the input memristors.
This sets the higher bound on V0 . Hence, the constraints on an n-input MAGIC NOR
gate execution voltage V0 should be
voff ROFF
· RON + ||RON < V0 , (8.1)
RON n−1
ROFF nRON
V0 < min voff · 1 + , |von | · 1 + , (8.2)
nRON ROFF
which ensures that the gate executes a NOR operation, and the input data is never
destroyed. Note that the aforementioned constraint is constructed neglecting the
parasitic effects of wires. In a more realistic scenario, where a unit interconnect
resistance of rw is considered between two adjacent wordlines/bitlines, (8.1) becomes
voff ROFF
· RON + ||RON < V0 , (8.3)
RON n−1
ROFF
+ RON
n (R + nRON )
V0 < min voff · , |von | · OFF . (8.4)
RON ROFF
where RON and ROFF denote the effective resistances and are equal, respectively, to
(RON + iRw ) and (ROFF + iRw ). Note that these expressions are similar to (8.1, 8.2).
It is possible to further extend the execution of a MAGIC NOR operation from a
memory row to a memory column in the transpose memory [39]. Figure 8.6a shows
the schematic of a MAGIC NOR gate on a memory column. In this case, the MAGIC
execution voltage (V0 ) is applied to the output memristor, and the parallel combina-
tion of the input memristors is grounded from the side, which is not connected to
the output memristor. This is the only difference between them, and the range of V0
200 N. Talati et al.
Fig. 8.6 a MAGIC NOR execution over a memristive memory column. b Attempt to execute two
distinct MAGIC NOR operations over the same row simultaneously, and c its equivalent circuit
schematic, demonstrating the wrong operation
Table 8.1 Steps involved in MAGIC NOR execution across a row (column) of a memristive memory
Step # Operation Application of voltages
1 Initialize output memristor at RON out ← VW RITE
2 Apply V0 to the input (output) in1, in2, . . . ← V0 (GND) and
memristor(s), and ground to the output out ← GND(V0 )
(input) memristor(s) for execution over a
memory row (column)
is the same as in the previous case of NOR logic execution, which is nondestructive
in terms of its inputs. The steps involved in MAGIC execution over both rows and
columns are summarized in Table 8.1.
The parallelism of MAGIC within crossbar arrays is limited; two independent
MAGIC NOR gates cannot be executed simultaneously in the same row, as illustrated
in Fig. 8.6b. If V0 is applied to two different sets of input memristors ({IN11 , IN21 } and
{IN12 , IN22 }), and output memristors ({OU T 1 , OU T 2 }) are grounded, the equivalent
circuit becomes as shown in Fig. 8.6c. Due to the connection pattern between the
input and the output memristors, two output memristors are actually connected in
parallel, leaving the equivalent resistance at the output as RON /2, rather than RON ,
resulting in a wrong operation.
While the MAGIC execution voltages are applied to wordlines or bitlines (for trans-
pose MAGIC operation), the influence of these voltages is spread throughout the
whole data line, and not limited to the particular memory row/column. As shown in
Fig. 8.7, if V0 is applied to the first two columns, and the third column is grounded,
8 mMPU—A Real Processing-in-Memory Architecture … 201
Parallel Isolated
MAGIC NOR Row
Execution
VISO
Fig. 8.7 a Intrinsic parallel MAGIC NOR execution over for data present in all the rows, and b
isolation of a row using an isolation voltage applied to that row (i.e., VISO ) to prevent execution of
MAGIC NOR
all the memristors situated in the first column perform the MAGIC NOR operation
with its neighboring cell on the second column and produce the output on the corre-
sponding cell in the third column. This situation can be exploited to perform vector
operations [39]. Note that the latency to perform this vector operation is independent
of the size of the vector, as long as the entire vector can fit inside an array, and the
voltage drivers can provide the required currents for proper behavior.
If the vector operation is restricted to few rows in the array, it is possible to
isolate a particular row from the MAGIC execution. This is achieved using isolation
voltages, which are similar to half-select voltages for write operations. While in
write operations, half of the voltage is applied (i.e., Vwrite /2) to prevent the unwanted
logic operations, applying V0 /2 in a MAGIC NOR operation would disturb the input
memristors. Hence, we propose ranges of voltages that can be applied to isolate
rows/columns, thus preventing them from executing a MAGIC NOR operation as
shown in Fig. 8.7b. When a MAGIC operation is performed over the rows, VISO must
fulfill
V0
0 < |VISO | < |voff | < , (8.5)
2
and when a MAGIC operation is performed over columns, VISO should carry out
where von and voff are the SET and RESET switching thresholds for the memristor and
V0 is the MAGIC execution voltage. The voltage levels that should be supported by the
peripheral circuit in order to perform conventional memory operations and execute
MAGIC logic within the memristive memory are listed in Table 8.2. Figure 8.8 shows
the design of the peripheral circuit needed to support these operations and the voltage
levels inside the memristive memory. Analog multiplexers, as shown in Fig. 8.8b, can
202 N. Talati et al.
Table 8.2 Voltage levels supported by the peripheral circuit to perform conventional memory
operations and execute MAGIC NOR gates within the memory
Operation Voltages applied
Write Vwrite = VSET and VRESET for writing logic 1 and 0
Read Vread
Ground GND
Half-select Vwrite /2
MAGIC execution V0
MAGIC isolation VISO
(a) (b)
V 2 V 4 V6 V8
BL V1 V 3 V 5 V7
log 2k V1 = VSET
V2 = VRESET
V3 = VREAD Memory
Controller
V4 = GND Operation
On-chip
Memristive
Decoder
V5 = VSET/2
Memory
WL
V6 = VRESET/2
V7 = Vctrl1 Logic
V8 = Vctrl2 Operation
SA To
WL/BL
WL = BL = SA = Sense log 2k
Wordlines Bitlines Amplifiers
be designed to assert different voltage levels to support write and MAGIC operations,
and a sense amplifier can be used to perform read operations.
While MAGIC NOR operations can be performed in every row (column) in parallel,
the length of the SIMD that can be implemented within a memristive crossbar is
restricted by the size of the array. The size of the array is further dependent on various
circuit and technological parameters. The circuit parameters crucial for deciding
the size of the array are the MAGIC execution voltage V0 , and the technological
parameters include memristive properties (RON , ROFF , von , and voff ) and parasitic
effects of the CMOS process (i.e., interconnect resistance and capacitance). To be
able to support MAGIC NOR operations in all the rows (columns) of the crossbar,
the MAGIC execution must be supported in the worst-case configuration at the row
(column) farthest from the voltage drivers, since the voltage across it would be the
lowest. Worst-case configuration occurs when all the resistance values in the array
8 mMPU—A Real Processing-in-Memory Architecture … 203
are RON and it is required to execute MAGIC over all the rows (columns). This is
because lower memristor resistance would require higher current to be drawn from
the drivers, and as a consequence, the IR drop across the parasitic resistances would
be high, lowering the voltage drop across the farthest memristor. Hence, given fixed
V0 and other technological parameters, a finite number of MAGIC NOR operations
will be supported, which will limit the size of the memristive crossbar.
Furthermore, to support the execution of multiple MAGIC NOR operations in
parallel, the voltage drivers would require a large current inside the array, which has
two consequences. First, to supply a current large enough to support several MAGIC
NOR operations, the drivers must also be large, which will increase the area of the
chip. Second, since V0 has a higher voltage level than write voltage, performing
many MAGIC NOR operations in parallel will increase the energy consumption.
Hence, while the goal is parallel execution of MAGIC NOR gates, this parallelism
will be limited by the area and power budget of the chip from the point of view of
the peripheral circuit.
The primary difference between a memristive memory and an mMPU is their con-
trol mechanism. In addition to supporting regular memory operations (i.e., read
and write), the mMPU controller also handles logic operations within the memory,
and in practice its implementation determines the performance of the mMPU. We
now present the modifications that must be made to the on-chip controller of the
mMPU [18]. We further show SIMPLE MAGIC [20], an automatic synthesis tool
we have developed that receives any arbitrary Boolean function as input and proposes
an optimal (in terms of latency, energy, or area) sequence of MAGIC NOR gates to
implement that function using the mMPU.
The mMPU controller is responsible for generating the control signals for the memory
to perform read, write, and logical operations within the mMPU. As shown in Fig. 8.9,
the CPU sends the instruction to the mMPU controller. This instruction is received
by a CPU-in block, where it is decoded. Then, this instruction is broadcasted to the
arithmetic, read, and write blocks, and a block suitable for the instruction type is
selected using the memory out mux. For example, if the CPU sends an arithmetic
instruction, the control sequence from the arithmetic block would be selected to be
sent to the memristive memory.
Whereas reads and writes in the mMPU are performed in a conventional way
[18], across the memristor over the target wordlines and bitlines, executing logical
instructions is more complicated since they require a sequence of logical steps. The
arithmetic block is a sophisticated finite state machine, the role of which is to effi-
204 N. Talati et al.
Opcode
Arithmetic WL = Wordlines
BL = Bitlines
Data Out
Fig. 8.9 Detailed block diagram of the mMPU controller, where an arithmetic block is added to
support computation within the memristive memory [18]
ciently break the instruction down into a series of MAGIC operations, and to select
the memristive cells to perform the operations within the memory array. For example,
the CPU sends an instruction to add two numbers (i.e., ADD) within the memory.
The instruction is received by the CPU-in block, which identifies the instruction as
ADD and generates the memory out mux select signal. Then, the instruction is sent
to the arithmetic block, where an appropriate, pre-synthesized execution sequence
is selected for this instruction. This execution sequence is then executed on the
memristive memory. The mMPU controller pipelines this operational sequence to
the memory, changing the applied voltages on each memory clock cycle. Efficient
pipelining maximizes the processing efficiency in terms of speed and energy. To opti-
mize the throughput of the arithmetic instruction execution, different considerations
should be taken into account [19], as detailed below.
To enable efficient data processing using the mMPU, novel algorithms (e.g., algo-
rithms based solely on MAGIC NOR operations) need to be developed. Exploit-
ing the parallelism offered by the mMPU as described in Sect. 8.3 is essential to
optimize these algorithms in terms of energy, performance, and area. For exam-
ple, multiplying K-binary matrices, each of which is of size M × N , requires
5NK − 5K + 2M + 1 steps when optimizing the algorithm for MAGIC NOR exe-
cution within the mMPU [18]. This algorithm has a quadratic time complexity of
O(NK), while in standard von Neumann architecture, a cubic time complexity of
O(NKM ) is required. This instance exemplifies the potential performance benefits
of processing data within the memory. Hence, design of a correct algorithm is the
key for efficient processing using the mMPU.
8 mMPU—A Real Processing-in-Memory Architecture … 205
Processing (P)
0 0
P S P S S
t1 t1
Storage (S) P S S P S
t2 t2
P S S S P
t3 Processing (P) / t3
Storage (S)
Time Time
Fig. 8.10 a Static processing area, where a portion of the memory space is dedicated for processing
(in blue), b dynamic processing area, where a portion of memory space, variable in location and size,
is allocated for processing or storage (in blue, purple, and orange), and allocation of processing (P)
and storage (S) areas with respect to time. The tables next to the figures denote the time multiplexing
of processing and storage space for both the schemes. Symbols S and P mean storage and processing,
respectively
Logic execution within the mMPU requires utilization of memory cells for computa-
tion. This utilization must maintain the integrity of the data stored in the memristive
memory. For example, while calculating complex Boolean functions, several MAGIC
NOR/NOT operations must be performed, and the intermediate values of these oper-
ations are also stored within the memristors, which we call functional memristors
[19, 39]. The functional memristors must be separated from the memristors where
valid data is stored, and the Operating System (OS) has to make sure that no data is
destroyed. One straightforward solution to this problem is to allocate a fixed amount
of memory space for processing; this is known as the static processing area [18]
as shown in Fig. 8.10a. A more complicated solution is to dynamically allocate the
processing area based on the availability of the memory cells and required amount
of functional memory space for processing; this is known as the dynamic processing
area, as shown in Fig. 8.10b.
Figure 8.10 shows the difference between static and dynamic processing areas.
It also shows how the dynamic technique time multiplexes the different portions of
the available memory for processing and storage, while the static technique uses
the dedicated areas for processing and storage. While the dynamic processing area
scheme efficiently allocates the memory space without any wastage, it requires a
costly memory management. In contrast, the static processing area scheme does not
require any memory management since the area is committed at design time, but it
suffers from lower memory utilization.
206 N. Talati et al.
Fig. 8.11 The desired logic function is synthesized using ABC [32] for NOR and NOT gates
and then optimized specifically for MAGIC within memory, generating a general mapping and a
sequence of operations. The general execution is mapped to specific cells in real time, based on the
temporary state of the mMPU and its available cells [20]
The state machine of the mMPU controller is designed to execute the sequence of
required NOR and NOT operations within mMPU. Wisely exploiting the parallelism
capabilities described in Sect. 8.3 to execute numerous NOR operations simultane-
ously on different rows or columns may significantly improve the computation per-
formance. To maximize the efficiency of the computations performed by the mMPU,
the controller has to be designed to perform an optimized NOR and NOT sequence
that is optimized in terms of either latency, energy, area, or a combination of the
three. The optimized algorithm is determined automatically using SIMPLE MAGIC
[20], a tool we recently developed.
SIMPLE receives any logic function, and performs the following flow, as illus-
trated in Fig. 8.11:
1. The function is converted into a netlist of NOR and NOT gates using a modified
ABC synthesis tool [32].
2. The netlist is mapped into a memristive memory, by solving an optimization
problem, using the z3 SMT solver [8]. Thus, for every gate j, the variables of
the problem are
• The coordinated wordline and bitline of the inputs Aj , Bj and output Ej of the
gate:
{RAj , CAj }, {RBj , CBj }, {REj , CEj } .
• Inputs and outputs of each MAGIC NOR gate have to be mapped to the same
column or the same row (as described in Sect. 8.3.2):
∀gate j : [(CAj = CBj = CEj ) ∩ (RAj = RBj = REj )] ∪ [(CAj = CBj = CEj )
∩(RAj = RBj = REj )].
(8.10)
• To perform several MAGIC gates in parallel, the inputs and outputs have to
be aligned (as shown in Fig. 8.7):
∀gate j, k : Tj = Tk ∪
{{[(CAj = CAk ∩ CBj = CBk ) ∪ (CAj = CBk ∩ CBj = CAk )] ∩ (CEj = CEk )}∩
(RAj = RBj = REj ∩ RAk = RBk = REk )}∪
{{[(RAj = RAk ∩ RBj = RBk ) ∪ (RAj = RBk ∩ RBj = RAk )] ∩ (REj = REk )}∩
(CAj = CBj = CEj ∩ CAk = CBk = CEk )}.
(8.11)
• A MAGIC gate can be executed only when its inputs were produced previously
and each input has to be located in the same memory cell as the output of the
gate connected to it.
The optimization problem can be solved for minimizing the latency, area, energy,
or a combination of them. For example, the optimization function for minimizing
latency is
3. The mapping is reshuffled in real time, according to the occupancy of the memory
at the moment the computation is done.
Automation of the process promises optimal results and reduces the time required
to design the mMPU controller. The first two steps are performed to design the
208 N. Talati et al.
250
200
150
100
50
0
5xp1 clip cm150a cm162a cm163a misex1 parity x2
Benchmarks
Fig. 8.12 Performance comparison of SIMPLE [20] (dark green) with other synthesis approaches,
which include Chakraborti et al. [5] (green), the original netlist without synthesis (blue), and the
netlist synthesized with ABC (yellow)
state machine of the arithmetic block of the mMPU controller, and the third step is
performed by the mMPU controller during run time. Figure 8.12 presents the per-
formance speedup of SIMPLE of 1.9× on average as compared to a NOT and NOR
netlist prior to optimization with SIMPLE (also before synthesizing the netlist with
ABC). Additionally, SIMPLE yields performance speedup of 1.94× compared to
previous work [5]. Two major factors contribute to the performance benefit of SIM-
PLE. SIMPLE tries to exploit the intrinsic parallelism offered by MAGIC NOR
execution within the memristive memory. Furthermore, while exploiting this paral-
lelism, SIMPLE rearranges the netlist in such a way that the copy operations of data
within the array are not required between the successive steps of execution. Current
and future improvements of SIMPLE may further increase performance.
(a) (b)
GPU DMA R/W Data GPU DMA
CPU mMPU CPU
Compute Cry- Cry-
DSP DSP
CMD pto pto
Accelerators Compute Accelerators
CMD
System Bus System Bus
R/W Data Argument and Results
DRAM
DRAM mMPU
(optional)
Fig. 8.13 Illustration of the possible mMPU usage models. When using the mMPU as a an acceler-
ator, data to be processed is copied from the main memory to the mMPU and computing commands
are sent from the CPU. When using the mMPU as b a part of the main memory, the data meant
for processing is stored beforehand in the mMPU address space, allowing the commencement of
processing with a single command from the CPU
Combined with careful data allocation, this usage model may avoid most of the data
transfers and further speed up computation. This enhancement, however, comes at
the cost of more complicated control (discussed later in this section), and with the
need to reserve parts of the available memory space (otherwise used to store data)
for intermediate results of the computation.
Data coherency also must be addressed. Using the mMPU allows data to be
modified in its location within the main memory and without modifying any instances
of the same data down the memory hierarchy (i.e., in caches). Therefore, maintaining
data coherency requires an added capability to invalidate data in caches if the data
was changed by the mMPU. When the mMPU is used as an accelerator, data that is
processed needs to be locked against changes (by using an atomic operation or some
other means) to avoid it being changed while the mMPU is processing. The concepts
of data redundancy and memory reliability also need to be addressed in order for
a system containing the mMPU to be seamlessly compatible with existing SW and
data correction mechanisms.
A programming model must be suited for each usage model for efficient utilization
of the mMPU. Because the rest of the system should be as oblivious to the mMPU as
possible, standard interfaces should be adopted, and the mMPU should be designed
so that minimal changes to the rest of the system are required. Furthermore, apart
from using mMPU for data processing, it can also be selectively used as the system
memory, making it compatible with the von Neumann computing model. Rather than
being burdened with challenging optimization tasks as in the case of conventional
architectures, for the general use case, the programmer only has to determine the
desired operation, the addresses of the inputs and outputs, and the size of the inputs.
Such an accelerator is addressed with software support, i.e., additional libraries with
specific functions that the mMPU will support, such as CUDA [44] in NVIDIA
210 N. Talati et al.
Fig. 8.14 Examples of the structure of an mMPU instruction. In a conventional memory access
instruction a the instruction is composed of a direction (read/write) bit, an address field, and a data
field. An instruction for in-memory computing b is always in the write direction, and written to
an address which is reserved by the controller for computing instructions. The rest of the bits are
used to transmit any information needed for the execution of the command and may specify the
operation to be carried out, the input/output location, size, etc.
GPUs. In this case, the CPU will offload the code to the mMPU directly without the
need to modify the ISA or the current conventional systems.
Two approaches are proposed for utilizing the mMPU as a memory capable of
computing. The first requires extending the ISA with additional commands that the
mMPU supports. These commands will be successively dispatched by the CPU to the
mMPU so that computation tasks are performed on specified locations in the memory
(i.e., addresses). In the second approach, the mMPU will have a reserved address,
which when written to will initiate the equivalent command. Thus, an instruction
for in-memory computing contains a write operation to a reserved address that is
mapped to a dedicated register within the mMPU controller. The instruction must
contain all the relevant information for execution, such as the required operation,
operands and result location, and size. An example of such an instruction is shown
in Fig. 8.14.
8.6 Conclusions
Data transfer between processing and memory units is the major performance and
energy-efficiency bottleneck of modern computing systems, commonly known as the
von Neumann bottleneck. Whereas prior art has tried to reduce the distance between
processing and memory units to solve this problem, we propose the mMPU, an
entirely different solution that can tackle the von Neumann bottleneck even more
efficiently. In the mMPU, we rely on employing memristive memory cells directly
for processing, which largely eliminates the necessity for data transfer. We also
present MAGIC, a technique to execute logical operations within the memristive
memory crossbar without any modification of the memory structure. We further
show how to extend execution of a single MAGIC gate to a parallel execution of
several MAGIC gates within the memory crossbar. We present our recent works
on the mMPU microarchitecture design, which includes the mMPU controller and
an automatic logic synthesis tool. Finally, we describe implications of the system
integration of the mMPU while using it in two different ways, i.e., an accelerator
mode and in a main memory mode. Applications that will benefit the most from this
new architecture include deep learning, image processing, DNA sequencing, and
8 mMPU—A Real Processing-in-Memory Architecture … 211
matrix multiplication, which have a high degree of intrinsic parallelism and large
amounts of data.
References
1. R. Balasubramonian, B. Grot, Near-data processing. IEEE Micro 36(1), 4–5 (2016). https://
doi.org/10.1109/MM.2016.1
2. B. Black, Die Stacking is Happening! Proceedings of the International Symposium on Microar-
chitecture (2013)
3. M.N. Bojnordi, E. Ipek, Memristive Boltzmann machine: a hardware accelerator for combina-
torial optimization and deep learning. In: 2016 IEEE International Symposium on High Per-
formance Computer Architecture (HPCA) (2016), pp. 1–13. https://doi.org/10.1109/HPCA.
2016.7446049
4. Y. Cassuto, S. Kvatinsky, E. Yaakobi, Sneak-path constraints in memristor crossbar arrays. In:
Proceedings of the IEEE International Symposium on Information Theory (ISIT) (2013), pp.
156–160
5. S. Chakraborti, P.V. Chowdhary, K. Datta, I. Sengupta, Bdd based synthesis of boolean functions
using memristors. In: 2014 9th International Design and Test Symposium (IDT) (2014), pp.
136–141. https://doi.org/10.1109/IDT.2014.7038601
6. Y.C. Chen et al., An access-transistor-free (0T/1R) non-volatile resistance random access mem-
ory (RRAM) using a novel threshold switching, self-rectifying chalcogenide device. In: IEEE
International on Electron Devices Meeting IEDM ’03 Technical Diges (2003), pp. 37.4.1–
37.4.4
7. P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, PRIME: A novel processing-
in-memory architecture for neural network computation in ReRAM-based main memory. In:
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
(2016), pp. 27–39. https://doi.org/10.1109/ISCA.2016.13
8. L. De Moura, N. Bjørner, Z3: an efficient SMT solver. In: Tools and Algorithms for the Con-
struction and Analysis of Systems (2008), pp. 337–340
9. P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, H. Noyes, An efficient and scalable
semiconductor architecture for parallel automata processing. IEEE Trans. Parallel Distrib. Syst.
25(12), 3088–3098 (2014). https://doi.org/10.1109/TPDS.2014.8
10. Y. Eckert, N. Jayasena, G.H. Loh, Thermal feasibility of die-stacked processing in memory.
In: Proceedings of the 2nd Workshop Near-Data Processing (2014)
11. D.G. Elliott, M. Stumm, W.M. Snelgrove, C. Cojocaru, R. Mckenzie, Computational RAM:
implementing processors in memory. IEEE Des. Test Comput. 16(1), 32–41 (1999). https://
doi.org/10.1109/54.748803
12. M. Gokhale, B. Holmes, K. Iobst, Processing in memory: the Terasys massively parallel PIM
array. Computer 28(4), 23–31 (1995). https://doi.org/10.1109/2.375174
13. L. Guckert, E.E. Swartzlander, MAD gates: Memristor logic design using driver circuitry. IEEE
Trans. Circuits Syst. II Exp. Briefs 64(2), 171–175 (2017). https://doi.org/10.1109/TCSII.2016.
2551554
14. Q. Guo, X. Guo, Y. Bai, E. Ipek, A resistive TCAM accelerator for data-intensive computing. In:
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.
ACM (2011), pp. 339–350
15. Q. Guo, X. Guo, R. Patel, E. Ipek, E.G. Friedman, AC-DIMM: associative computing with
STT-MRAM. ACM SIGARCH Comput. Arch. News 41(3), 189–200 (2013)
16. HSA Foundation: Harmonizing the Industry Around Heterogeneous Computing, http://www.
hsafoundation.com/
17. J.J. Huang, Y.M. Tseng, W.C. Luo, C.W. Hsu, T.H. Hou, One selector one resistor (1s1r)
crossbar array for high-density flexible memory applications. IEEE (2011), pp. 31.7.1–31.7.4
212 N. Talati et al.
18. R.B. Hur, S. Kvatinsky, Memristive memory processing unit (MPU) controller for in-memory
processing. In: 2016 IEEE International Conference on the Science of Electrical Engineering
(ICSEE) (2016), pp. 1–5. https://doi.org/10.1109/ICSEE.2016.7806045
19. R.B. Hur, N. Talati, S. Kvatinsky, Algorithmic considerations in memristive memory processing
units (MPU). In: CNNA 2016 15th International Workshop on Cellular Nanoscale Networks
and their Applications (2016), pp. 1–2
20. R.B. Hur, N. Wald, N. Talati, S. Kvatinsky, SIMPLE MAGIC: synthesis and in-memory MaP-
ping of logic execution for memristor-aided loGIC. In: Proceeding of the IEEE International
Conference on Circuits Aided Design (2017)
21. Hybrid Memory Cube Consortium, Hybrid Memory Cube Specification 1.0 (2013)
22. JEDEC Solid State Technology Association: High Bandwidth Memory (HBM) DRAM, http://
www.jedec.org/standards-documents/results/jesd235
23. S. Kvatinsky, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C. Weiser, Memristor-based
material implication (imply) logic: design principles and methodologies. IEEE Trans. Very
Large Scale Integr. (VLSI) Syst. 22(10), 2054–2066 (2014). https://doi.org/10.1109/TVLSI.
2013.2282132
24. S. Kvatinsky, N. Wald, G. Satat, A. Kolodny, U.C. Weiser, E.G. Friedman, MRL–memristor
ratioed logic. In: 2012 13th International Workshop on Cellular Nanoscale Networks and their
Applications (2012), pp. 1–6. https://doi.org/10.1109/CNNA.2012.6331426
25. S. Kvatinsky, E.G. Friedman, A. Kolodny, U.C. Weiser, The desired memristor for circuit
designers. IEEE Circuits Syst. Mag. 13(2), 17–22 (2013). https://doi.org/10.1109/MCAS.2013.
2256257
26. S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C.
Weiser, MAGIC - memristor-aided logic. IEEE Trans. Circuits Syst. II Express Briefs 61(11),
895–899 (2014). https://doi.org/10.1109/TCSII.2014.2357292
27. J. Lee, M. Jo, D. Jun Seong, J. Shin, H. Hwang, Materials and process aspect of cross-point
RRAM (invited). Microelectron. Eng. 88(7), 1113–1118 (2011)
28. Y. Levy, J. Bruck, Y. Cassuto, E.G. Friedman, A. Kolodny, E. Yaakobi, S. Kvatinsky, Logic
operations in memory using a memristive akers array. Microelectron. J. 45(11), 1429–1437
(2014)
29. H. Li et al., Write disturb analyses on half-selected cells of cross-point rram arrays. In: Pro-
ceedings of the IEEE International Reliability Physics Symposium (2014), pp. MY.3.1–MY.3.4
30. S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, Y. Xie, Pinatubo: a processing-in-memory architecture for
bulk bitwise operations in emerging non-volatile memories. In: Design Automation Conference
(DAC) (2016), pp. 1–6. https://doi.org/10.1145/2897937.2898064
31. W. Lynch, Worst-case analysis of a resistor memory matrix. IEEE Trans. Comput. C–18(10),
940–942 (1969)
32. A. Mishchenko, ABC: a system for sequential synthesis and verification (2012), http://www.
eecs.berkeley.edu/~alanmi/abc/
33. M. Oskin, F.T. Chong, T. Sherwood, Active pages: a computation model for intelligent memory.
SIGARCH Comput. Archit. News 26(3), 192–203 (1998). https://doi.org/10.1145/279361.
279387
34. G. Papandroulidakis, I. Vourkas, N. Vasileiadis, G.C. Sirakoulis, Boolean logic operations and
computing circuits based on memristors. IEEE Trans. Circuits Syst. II Exp. Briefs 61(12),
972–976 (2014). https://doi.org/10.1109/TCSII.2014.2357351
35. D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K.
Yelick, A Case for Intelligent RAM. IEEE Micro 17(2), 34–44 (1997). https://doi.org/10.1109/
40.592312
36. D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K.
Yelick, Intelligent RAM (IRAM): chips that remember and compute. In: 1997 IEEE Inter-
national Solids-State Circuits Conference. Digest of Technical Papers (1997), pp. 224–225.
https://doi.org/10.1109/ISSCC.1997.585348
37. J. Reuben, R. Ben-Hur, N. Wald, N. Talati, A.H. Ali, P.E. Gaillardon, S. Kvatinsky, Memristive
logic: a framework for evaluation and comparison. In: International Symposium on Power and
Timing Modeling, Optimization, and Simulation (PATMOS) (2017) (in press)
8 mMPU—A Real Processing-in-Memory Architecture … 213
38. S. Shin, K. Kim, S.M. Kang, Analysis of passive memristive devices array: data-dependent
statistical model and self-adaptable sense resistance for RRAMs. Proc. IEEE 100(6), 2021–
2032 (2012)
39. N. Talati, S. Gupta, P. Mane, S. Kvatinsky, Logic design within memristive memories using
memristor-aided loGIC (MAGIC). IEEE Trans. Nanotechnol. 15(4), 635–650 (2016). https://
doi.org/10.1109/TNANO.2016.2570248
40. K. Wang, Y. Qi, J.J. Fox, M.R. Stan, K. Skadron, Association rule mining with the micron
automata processor. In: 2015 IEEE International Parallel and Distributed Processing Sympo-
sium (2015), pp. 689–699. https://doi.org/10.1109/IPDPS.2015.101
41. H.S.P. Wong, H.Y. Lee, S. Yu, Y.S. Chen, Y. Wu, P.S. Chen, B. Lee, F.T. Chen, M.J. Tsai, Metal
oxide RRAM. Proc. IEEE 100(6), 1951–1970 (2012). https://doi.org/10.1109/JPROC.2012.
2190369
42. W. Woods, M.M.A. Taha, S.J.D. Tran, J. Brger, C. Teuscher, Memristor panic: a survey of
different device models in crossbar architectures. In: Proceedings of the 2015 IEEE/ACM
International Symposium on Nanoscale Architectures (NANOARCH15) (2015), pp. 106–111.
https://doi.org/10.1109/NANOARCH.2015.7180595
43. L. Xie, H.A.D. Nguyen, M. Taouil, S. Hamdioui, K. Bertels, Fast boolean logic mapped on
memristor crossbar. In: International Conference on Computer Design (2015), pp. 335–342.
https://doi.org/10.1109/ICCD.2015.7357122
44. C.T. Yang, C.L. Huang, C.F. Lin, Hybrid cuda, openmp, and mpi parallel programming on
multicore gpu clusters. Comput. Phys. Commun. 182(1), 266–269 (2011)
45. L. Yavits, S. Kvatinsky, A. Morad, R. Ginosar, Resistive associative processor. IEEE Comput.
Arch. Lett. 14(2), 148–151 (2015). https://doi.org/10.1109/LCA.2014.2374597
46. Y. Zha, J. Li, Reconfigurable in-memory computing with resistive memory crossbar. In: Inter-
national Conference on Computer-Aided Design (2016), pp. 1–8. https://doi.org/10.1145/
2966986.2967069
47. M.A. Zidan, H.A.H. Fahmy, M.M. Hussain, K.N. Salama, Memristor-based memory: the sneak
paths problem and solutions. Microelectron. J. 44(2), 176–183 (2013)
Chapter 9
Spintronic Logic-in-Memory Paradigms
and Implementations
Abstract In current big data era, the limitation of data transfer bandwidth (memory
wall) between the processor and the memory, and the increase of energy consumption
associated with the data transfer (power wall) have become the most urgent problems
for conventional von-Neumann architecture, owing to the physical separation of
the processor and the memory units (see Fig. 9.1a) and the performance mismatch
between the two.
9.1 Introduction
In current big data era, the limitation of data transfer bandwidth (memory wall)
between the processor and the memory, and the increase of energy consumption
associated with the data transfer (power wall) have become the most urgent prob-
lems for conventional von-Neumann architecture, owing to the physical separation
of the processor and the memory units (see Fig. 9.1a) and the performance mismatch
between the two [1–3]. On one hand, the workloads are growing exponentially with
time in current big data era, such as big data analytics, artificial intelligence, and bioin-
formatics, which generally operate on large data-sets, leading to frequent accesses to
the off-chip memory. On the other hand, moving data may even be much more expen-
sive than computing itself, e.g., a DRAM access needs 200 times more energy than
a floating-point operation [4, 5]. Increasing the available bandwidth, through either
increasing the number or the frequency of the channel, is a robust solution to address
the communication bottleneck, which, however, significantly increases the cost and is
not scalable [6]. Recent hardware/architecture design paradigms have moved towards
greater specialization, and specialized units for memory-centric computing are vital
to future solutions [7, 8]. The logic-in-memory (LIM) paradigm, which attempts
to embed computation capability into the memory, and to realize the unity of data
storage and processing in the same die/chip, thus exhibiting great feasibility to break
the communication bottleneck of conventional von-Neumann architecture [9–17].
The basic concept of LIM can be back to 1970s [9], and the initial idea is to add
some logic units close to the memory chips through a plane (see Fig. 9.1b) or 3D
(see Fig. 9.1c) structure, to perform operations being simple yet bandwidth-intensive
and/or latency-sensitive [6]. Strictly speaking, the initial LIM concept is more like
logic-near-memory (LNM) by reducing the data transfer distance or by adding more
memory close to the processor but without decreasing the memory access [18–21].
The premise is that being close to the memory chips, the LNM module will have much
lower latency and higher bandwidth to the memory than to the processor, thus reduc-
ing the off-chip memory bandwidth requirements as well as improving the system
performance and energy efficiency [7]. Many prior and recent works have proposed
various approaches. Based on the degree of integration between logic and memory,
they can be classified into these two broad categories, i.e., LNM and LIM. Please note
that several similar terminologies are also used in different communities, such as in-
memory computing, computing-in-memory, near-memory-computing, in-memory-
processing, processing-near-memory, and processing-in-memory. Although both
LNM and LIM can alleviate the communication bottleneck, the latter one fundamen-
tally innovates the computing architecture from conventional von-Neumann archi-
tecture and brings benefit of reducing the number of memory accesses (see Fig. 9.1d).
In this chapter, we focus on the LIM paradigm. To date, the LIM research has fea-
tured a rich design space, from device, circuit, to architecture innovations, however,
such promising studies could not render practical prototypes due to the incompati-
bility of the state-of-the-art logic and memory technologies in terms of design com-
plexity and fabrication cost [7]. The emergence of 3D integration technology and
nonvolatile memories (NVMs) provide alternative possibilities for effectively and
efficiently implementing LIM hardware [2, 5, 7, 8, 13–15, 21–25]. On one hand,
the 3D-stacking functionality of the NVM devices allows decoupling the logic and
memory circuits in different manufacturing processes by using the back-end-of-line
(BEOL) process, therefore alleviating the fabrication complexity as well as cost. On
the other hand, the resistance-based storage mechanisms of the NVM devices provide
inherent logic functions, thus enabling to embed energy-efficient logic computing
capability within the memory [5].
Recently, a lot of studies have demonstrated that NVMs, such as resistive random-
access memory (ReRAM) [23, 24], magnetic RAM (MRAM) [25, 26], and phase
change memory (PCM) [5, 27], are qualified for performing logic operations beyond
data storage. The NV devices act as both logic and memory units in the same die, thus
promising a radical renovation of the relationship between computation and mem-
ory. The NVM-based LIM architectures exploit either the peripheral circuitry (e.g.,
sensing circuits) or the memory cells already existing inside the memory die (or with
minimal changes) rather than adding new logic units in the memory chip to perform
computing tasks. For example, ReRAM can perform matrix-vector multiplication
efficiently in a crossbar structure. It has been widely studied to represent multistate
synapses in neural computation [24, 28]. On the other hand, Boolean logic opera-
tions in ReRAM, MRAM, and PCM have also been widely studied, e.g., through
9 Spintronic Logic-in-Memory Paradigms and Implementations 217
Fig. 9.1 Possible evolution of the computing architecture; a conventional von-Neumann architec-
ture with a separated processor (central processing unit, CPU) and memory; b, c the logic-near-
memory (LNM) architecture with plane and 3D implementations by adding a small amount of logic
units close to the memory or by adding more memory close to the processor; d the logic-in-memory
(LIM) architecture attempts to embed computation capability into the memory, and to realize the
unity of data storage and processing at the smallest grain in the same die [8]
LIM exploits the 3D integration capability of the spintronic devices (mainly refer
to magnetic tunnel junctions, MTJs [31, 32]) to reduce the global routings and data
transfer distance between the memory and the logic units, as shown in Fig. 9.2. More
importantly, by embedding the nonvolatile spintronic devices, the temporarily unused
blocks could be completely powered off during the idle state while maintaining the
data, thus saving standby power. Moreover, data can be instantaneously recovered,
therefore, this approach is suitable for the “instant-on” and “normally-off” systems.
In addition, the area can be largely reduced since the same spintronic devices are
fabricated on top of the CMOS circuits and do not occupy extra area [33].
Figure 9.3a illustrates the schematic of the hybrid spintronic/CMOS-based LIM
architecture [34, 35], which is mainly composed of three main parts: (a) a current-
218 W. Kang et al.
MTJ
Spintronic memory
Logic CMOS
/Metal
Si
Fig. 9.2 3D spintronic/CMOS LIM architecture which integrates non-volatility into the logic cir-
cuits
mode sense amplifier to detect the currents of the two branches, and then to evaluate
the logic output result; (b) a writing block to program the data stored in the spintronic
memory cells; (c) a CMOS logic network (LN) that performs the logic computation.
LN contains MTJ devices for nonvolatile inputs and a CMOS logic tree for volatile
inputs in order to keep an area-power-efficient advantage. In this case, the volatile
logic data can be driven by a high processing frequency contrarily to the nonvolatile
data stored in the spintronic memory cells, which should be changed with a relatively
low frequency, i.e., they are quasi-constant for computing. The CMOS transistors
and MTJs are the main components of LN, as shown in Fig. 9.3b [33, 36].
• CMOS transistor is used as a variable resistor, whose resistance is controlled by
an external volatile input voltage (X ) applied to the gate (G) terminal. If X = ‘1’,
the CMOS transistor is conducted with a low resistance (R O N ∼ k). Otherwise,
the CMOS transistor is blocked and has a high resistance (R O F F ∼ G).
• The MTJ device is used not only as a storage element but also as a logic input
operand. It has a low resistance (R P ) and stores a logic data ‘1’ (Y = ‘1’) when it
is in a parallel state; otherwise, if the MTJ is in an antiparallel state, its resistance
becomes high (R A P ) and it stores a logic data ‘0’ (Y = ‘0’). The resistance differ-
ence between two resistances depends on the tunneling magnetoresistance effect
(TMR) ratio.
The reading current (I L or I R ) is inversely proportional to the total resistance
or R R ) of the left or the right branch in the LN. Two complementary outputs
(R L
z and z corresponding to the two opposite logic values are determined by the read-
ing currents, providing differential logic operations. If the current of the left branch is
larger than that of the right branch (I L > I R ), the output results on nodes z and z are
‘1’ and ‘0’, respectively; otherwise if I L < I R , the corresponding output results are
then z = ‘0’ and z’ = ‘1’, respectively. By configuring the LN, different nonvolatile
logic functions can be realized, such as OR/NOR, AND/NAND, XOR/NXOR, look-
up-table, flip-flop, full adder. More details can refer to [37–42]. Figure 9.4 shows the
LN configurations for different logic operations that are proposed and analyzed in
9 Spintronic Logic-in-Memory Paradigms and Implementations 219
x i , yj , zn ϵ{0,1}
OFF if x =0
Writing Circuit
if y =1
Non-volatile Logic Data MTJ y R MTJ = {RR P
if y =0
(MTJs) AP
Logic Network (LN)
(a) (b)
Fig. 9.3 a Schematic of the hybrid spintronic/CMOS-based LIM architecture; b components in
the logic network (LN) [36]
[40]. Figure 9.5 shows an example of 1-bit full adder [41] based on the above spin-
tronic LIM paradigm. The CMOS logic tree of the full adder is designed according to
(9.1)–(9.4), where A (/A: the complement of A) and Ci (/Ci : the complement of Ci )
are the volatile input operands while B (/B) relates to the nonvolatile input operand
stored in the MTJs [33].
By integrating the spintronic devices directly into the logic circuits, power sup-
ply can be cut off during the standby mode. Therefore, the hybrid spintronic/CMOS
based LIM architecture could provide a way to realize ultra-low power consump-
tion and high-performance computing capability for the next generation processor.
Moreover, some computing system paradigms, such as brain-inspired computing, are
220 W. Kang et al.
/Qm Qm /Qm Qm
IL IR IL IR /Qm Qm
IL IR
A /A A /A A /A
/A A /A A
B /B B /B B /B
LB RB
LB RB LB RB
M
M M
Fig. 9.4 Structure of the logic network (LN) for nonvolatile a AND logic gate b OR logic gate
c XOR logic circuit. “LB” and “RB” represent the left and right branches, respectively [40]
Vdd PCSA
CLK CLK CLK P2 P6 P7 P3 CLK
P0 P4 P5 P1
Truth table
/SUM SUM /Co Co
N0 N1
A B Ci SUM Co
N2 N3
0 0 0 0 0
A
N6 N7 /A N8 N9 A 0 0 1 1 0
A
N14 N15
Ci /A /Ci 0 1 0 1 0
N16 N17
Ci /Ci Ci 0 1 1 0 1
N10 N11 N12 N13
CMOS logic tree 1 0 0 1 0
Vdda 1 0 1 0 1
V0 V2 V0 V2 1 1 0 0 1
P8 P9 P10 P11
B /B B /B 1 1 1 1 1
V1 V3 V1 V3
N12 N13 N14 CLK N15
CLK N4 N5
Writing circuit
Gnd Gnd
SUM sub-circuit CARRY sub-circuit
Fig. 9.5 Full schematic and truth table of the 1-bit full adder based on the hybrid spintronic/CMOS
LIM architecture [41]
mainly caused by the device mismatch (both CMOS and nonvolatile devices) of the
sensing circuit. Unlike the memory chips where complex error correction circuits
(ECCs) can be employed, it is difficult to embed ECCs in the logic circuits while
keeping high speed, high power efficiency, and low area simultaneously. Therefore,
alternative high-reliability solutions should be presented for this approach. Current
research efforts that concentrate on this topic are fast-access and high-TMR MTJ
development, high-performance sensing circuit design, low-cost and reliable inte-
gration process, etc. [2].
In this approach, the core memory cell array is exactly the same as a standard memory,
thus the storage density and energy efficiency of the regular read and write operations
can be maintained. The basic concept to perform logic computation is to exploit
the peripheral circuitry (e.g., read circuit) for performing a range of bulk bit-wise
arithmetic operations [44–48]. Figure 9.6 shows the circuit schematic of a typical
spin transfer torque magnetic random-access memory (STT-MRAM) bank and the
related 1T1MTJ bit-cell and the peripheral sense circuit. Here 1T1MTJ refers to one
CMOS transistor connected with one MTJ device in series. The STT-MRAM bank
is generally organized with an array of 1T1MTJ bit-cells via a number of bit-lines
(BLs), source-lines (SLs), word-lines (WLs) and peripheral circuits, e.g., write/read
drivers, row/column decoders, and input/output (I/O) interfaces.
VDD
Sense Amplifiers
Wordline Driver
Ctrl Ctrl
P0 P1 P2 P3
Row Decoder
Out Out_bar
STT-MRAM N0 N1
row address
Bit-Cell Array
1T1MTJ bit-cells
Bit-line
Write Driver
Fig. 9.6 Schematic of a STT-MRAM bank and the associated 1T1MTJ bit-cell structure and sense
amplifier [44]
222 W. Kang et al.
BL SL RL SL BL SL RL SL
Fig. 9.7 The key concept of difference reference selections to perform a memory read and b LIM
operations [46]
Different reference thresholds can be chosen to perform memory read and LIM
operations [46]. As shown in Fig. 9.7a, for a memory read operation, an addressed
memory cell is selected by the target BL, WL, and SL, and is embedded in the read
path to generate a data sense voltage (Vdata ), which will be compared with a ref-
erence voltage Vr e f through a sensing amplifier. Owing to different states of the
selected bit-cell (parallel or antiparallel state corresponding to low or high resistance,
R P or R A P ), Vdata could be V P or V A P (V P < V A P ) respectively. Thus, by setting
the reference voltage at (V P + V A P )/2, the sense amplifier outputs a binary bit ‘1’
or bit ‘0’ when Vdata > Vr e f or Vdata < Vr e f . For comparison, Fig. 9.7b depicts the
sensing-based LIM operations (with two input operands as an example) using the
peripheral read circuit, where two memory bit-cells are addressed simultaneously.
Owing to the different resistance combinations of the two selected bit-cells, i.e.,
(R A P , R A P ), (R A P , R P ), and (R P , R P ), three different data sense voltages Vdata ,
denoted as V A P,A P , V A P,P , and V P,P , respectively, could be generated. Consider
setting the reference
voltage Vr e f as (V A P,A P + V A P,P )/2 by tuning the reference
resistance Rr e f , the sense amplifier only outputs binary ‘1’ when both selected
bit-cells are in antiparallel states, i.e., Vdata > Vr e f . Thus, this sensing operation
with modified reference voltage performs an AND/NAND logic operation taken the
binary data stored in the two bit-cells as the two logic input operands. Similarly, when
the reference voltage is shifted to (V P,P + V A P,P )/2, the OR/NOR logic operation
can be performed. More details can refer to related papers [45, 46]. A XOR logic
operation can also be realized when the two sensing schemes shown in Fig. 9.7 are
used in conjunction with a CMOS-based NOR logic gate or by modifying the sensing
circuit [45]. Furthermore, a full adder and other more complex logic functions can
be achieved by a combination of the above-described operations [45]. This approach
can be extended to the case with multiple input operands by tuning the corresponding
9 Spintronic Logic-in-Memory Paradigms and Implementations 223
This approach exploits the memory cells for logic operations and the key idea is to
dynamically configuring the memory cell states with a regular memory-like write
and read operations depending on the combination of the logic input operands. The
initial data stored in the memory cell acts as one of the input operands and the logic
output is represented by the final resistance state of the memory cell, which is in situ
stored in the same memory cell through a regular memory-like write operation and
can be output via the sense amplifier with a regular memory-like readout manner [49,
50]. Below we take an advanced spintronic memory, which is based on the three-
terminal voltage-gated spin Hall effect (VG-SHE) based MTJ devices [51, 52], to
describe the LIM concept and implementation.
Figure 9.8 shows the schematic and switching behavior of the VG-SHE-MTJ
device, which exploits both the SHE [53, 54] and voltage-controlled magnetic
anisotropy (VCMA) effect [55, 56] for MTJ switching. For SHE-driven MTJ switch-
ing mechanism, the critical current can be modulated by applying a bias voltage across
the MTJ via the VCMA mechanism.
The key idea for logic computation is to modulate the final resistance state (denoted
as the stateful logic output result) of the MTJ device with two different inputs (i.e.,
the VCMA bias voltage and the SHE write current). Without loss of generality, we
can assume that a high resistance (low resistance) state of the MTJ represents a
logical data ‘1’ (data ‘0’). Furthermore, we can assume that the first input data (A) is
denoted by the VCMA bias voltage (Vb ). In specific, a positive VCMA bias voltage
(with amplitude +Vb = 600 mV) denotes the logical inputs “A = 1” while a zero
VCMA bias voltage denotes the logical inputs “A = 0”. The second input data (B)
is denoted by the initial data value (i.e., resistance) stored in the MTJ device. The
third input (C) is denoted by the polarity of the SHE write current (I S H E ). A positive
SHE write current (+I S H E ) denotes logical input “C = 1” while a negative SHE
write current (−I S H E ) denotes logical input “C = 0”, respectively. Here, we need
|IC1 | < |I S H E | < |IC2 |, e.g., |I S H E | = 65 µA for correct logic computation. In this
configuration, if A = 1 (Vb = +600 mV), the critical current for SHE-driven MTJ
magnetization switching is |IC1 |, I S H E can switch the MTJ state and the final MTJ
224 W. Kang et al.
(a) (b)
z VMTJ Vb1
Current (µA)
(c) Vb1 (d)
1.0
Vb0
E (V ) Vb1 = 400 mV
mZ
Fig. 9.8 Three-terminal VG-SHE-driven MTJ device. a Device schematic; b voltage-gating mech-
anism on the critical current for SHE-driven magnetization switching under different bias voltages;
c, d illustration of the energy barrier and the corresponding magnetization switching under two
different bias voltages; e the critical SHE switching current as a function of the applied bias voltage
across the MTJ device [49]
0 Bi 0 C =1 C=B
‘AND’
Bi +1 = A + Bi Bi +1 = A Bi
1 Bi 1 ‘OR’ ‘XOR’
where Bi and Bi+1 are the initial input data and final logical output result in situ stored
in the MTJ, respectively. We can find that the input C is a control signal. In specific,
if C = 1, then Bi+1 = A + Bi , performing an “OR” logic operation; otherwise,
if C = 0, then Bi+1 = ABi , performing an “AND” logic operation. Regarding the
“XOR” logic operation, we can firstly readout Bi from the MTJ and set C = Bi , then
we can get Bi+1 = AB i + ABi , performing a “XOR” logic operation. Figure 9.9
shows the state transition diagram, truth table and Karnaugh map. It should be noted
that all Boolean logic functions can be realized by reconfiguring the input signals.
The logic output Bi+1 is in situ stored in MTJ. Besides, one additional memory-like
read operation is needed to readout the logic output. It can be seen that the logic
operations in the VG-SHE-driven MTJ based spintronic memory is very similar
to the regular write/read operations for a memory data access. This LIM approach
can be implemented either in a typical 2T1MTJ cell array or in a crossbar array
structures owing to the sharing path of the SHE write current. More details can refer
to [49]. Similar LIM concept can also be extended to STT-MRAM by changing the
bit-cell structure [57]. In this LIM approach, the memory can work in either the
memory mode or logic mode, as shown in Fig. 9.10, depending on the application-
oriented requirements. This approach is applicable to any resistive memories with
226 W. Kang et al.
LIM cores
Reconfigurable
Logic mode Memory mode
Acknowledgements This work was supported by the National Natural Science Founda-
tion of China (61871008 and 61571023), the National Key Technology Program of China
(2017ZX01032101), and the International Mobility Project (B16001 and 2015DFE12880).
9 Spintronic Logic-in-Memory Paradigms and Implementations 227
References
1. N.S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J.S. Hu, M.J. Irwin, M. Kandemir, V.
Narayanan, Leakage current: Moore’s law meets static power. Computer 36(12), 68–75 (2003)
2. W. Kang, Y. Zhang, Z. Wang, J. Klein, C. Chappert, D. Ravelosona, G. Wang, Y. Zhang, W.
Zhao, Spintronics: emerging ultra-low power circuits and systems beyond MOS technology.
ACM J. Emerg. Technol. Comput. Syst. 12(2), 1–42 (2015)
3. W.A. Wulf, S.A. McKee, Hitting the memory wall: implications of the obvious. ACM
SIGARCH Comput. Arch. News 23(1), 20–24 (1995)
4. S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, D. Glasco, GPUS and the future of parallel
computing. IEEE Micro 31(5), 7–17 (2011)
5. S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, Y. Xie, Pinatubo: a processing-in-memory architecture
for bulk bitwise operations in emerging non-volatile memorties, in ACM/EDAC/IEEE Design
Automation Conference (2016), pp. 1–6
6. V. Seshadri, O. Mutlu, The Processing Using Memory Paradigm: In-DRAM Bulk Copy, Ini-
tialization, Bitwise AND and OR, arXiv:1610.09603 (2016)
7. Z. Chowdhury, J.D. Harms, S.K. Khatamifard, M. Zabihi, Y. Lv, A.P. Lyle, S. Sapatnekar,
U.R. Karpuzcu, J.-P. Wang, Efficient in-memory processing using spintronics. IEEE Comput.
Archit. Lett. 17(1), 42–46 (2018)
8. M.A. Zidan, J.P. Strachan, W.D. Lu, The future of electronics based on memristive systems.
Nat. Electron. 1(1), 22–29 (2018)
9. H.S. Stone, A logic-in-memory computer. IEEE Trans. Comput. C-19(1), 73–78 (1970)
10. J. Ahn, S. Yoo, O. Mutlu, K. Choi, PIM-enabled instructions: a low-overhead, locality-aware
processing-in-memory architecture, in 2015 ACM/IEEE 42nd Annual International Symposium
on Computer Architecture (2015), pp. 336–348
11. J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, A scalable processing-in-memory accelerator
for parallel graph processing, in 2015 ACM/IEEE 42nd Annual International Symposium on
Computer Architecture (2015), pp. 105–117
12. D.G. Elliott, M. Stumm, W.M. Snelgrove, C. Cojocaru, R. McKenzie, Computational RAM:
implementing processors in memory. IEEE Des. Test Comput. 16(1), 32–41 (1999)
13. W. Kang, Z. Wang, Y. Zhang, J.O. Klein, W. Lv, W. Zhao, Spintronic logic design methodology
based on spin Hall effect-driven magnetic tunnel junctions. J. Phys. D Appl. Phys. 49(6), 065008
(2016)
14. D. Fan, S. Angizi, Z. He, In-memory computing with spintronic devices, in 2017 IEEE Com-
puter Society Annual Symposium on VLSI (2017), pp. 683–688
15. W. Kang, C. Zheng, Y. Zhang, D. Ravelosona, W. Lv, W. Zhao, Complementary spintronic
logic with spin Hall effect-driven magnetic tunnel junction. IEEE Trans. Magn. 51(11), 1–4
(2015)
16. P.E. Gaillardon, L. Amaru, A. Siemon, E. Linn, R. Waser, A. Chattopadhyay, G.D. Micheli,
The programmable logic-in-memory (PLiM) computer, in IEEE Design, Automation and Test
in Europe Conference and Exhibition (2016), pp. 427–432
17. R. Nair, S.F. Antao, C. Bertolli, P. Bose, J.R. Brunheroto, T. Chen, C.-Y. Cher, C.H.A. Costa,
J. Doi, C. Evangelinos, B.M. Fleischer, T.W. Fox, D.S. Gallo, L. Grinberg, J.A. Gunnels, A.C.
Jacob, P. Jacob, H.M. Jacobson, T. Karkhanis, C. Kim, J.H. Moreno, J.K. O’Brien, M. Ohmacht,
Y. Park, D.A. Prener, B.S. Rosenburg, K.D. Ryu, O. Sallenave, M.J. Serrano, P.D.M. Siegl, K.
Sugavanam, Z. Sura, Active memory cube: a processing-in-memory architecture for exascale
systems. IBM J. Res. Dev. 59(2/3), 17:1–17:14 (2015)
18. M. Gao, G. Ayers, C. Kozyrakis, Practical near-data processing for in-memory analytics frame-
works, in 2015 International Conference on Parallel Architecture and Compilation (2015),
pp. 113–124
19. K. Chen, S. Li, N. Muralimanohar, J.H. Ahn, J.B. Brockman, N.P. Jouppi, Cacti-3dd:
architecture-level modeling for 3d die-stacked dram main memory, in IEEE Design, Automa-
tion and Test in Europe Conference and Exhibition (2012), pp. 33–38
228 W. Kang et al.
20. A.F. Farahani, J.H. Ahn, K. Morrow, N.S. Kim, NDA: Near-DRAM acceleration architecture
leveraging commodity DRAM devices and standard memory modules, in 2015 IEEE 21st
International Symposium on High Performance Computer Architecture (2015), pp. 283–295
21. H.-S. Philip Wong, S. Salahuddin, Memory leads the way to better computing. Nat. Nanotech-
nol. 10(3), 191–194 (2015)
22. A. Chen, A review of emerging non-volatile memory (NVM) technologies and applications.
Solid-State Electron. 125, 25–38 (2016)
23. J. Borghetti, G.S. Snider, P.J. Kuekes, J.J. Yang, D.R. Stewart, R.S. Williams, Memristive
switches enable stateful logic operations via material implication. Nature 464(7290), 873–876
(2010)
24. P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, PRIME: a novel processing-in-
memory architecture for neural network computation in ReRAM-based main memory. ACM
SIGARCH Comput. Arch. News 44(3), 27–39 (2016)
25. L. Wang, W. Kang, F. Ebrahimi, X. Li, Y. Huang, C. Zhao, K.L. Wang, W. Zhao, Voltage-
controlled magnetic tunnel junctions for processing-in-memory implementation. IEEE Electron
Device Lett. 39(3), 440–443 (2018)
26. N. Locatelli, V. Cros, J. Grollier, Spin-torque building blocks. Nat. Mater. 13(1), 11–20 (2014)
27. H. Zhang, G. Chen, B.C. Ooi, K.-L. Tan, M. Zhang, In-memory big data management and
processing: a survey. IEEE Trans. Knowl. Data Eng. 27(7), 1920–1948 (2015)
28. Z. Wang, S. Joshi, S. Savel’ev, W. Song, R. Midya, Y. Li, M. Rao, P. Yan, S. Asapu, Y. Zhuo,
H. Jiang, P. Lin, C. Li, J.H. Yoon, N.K. Upadhyay, J. Zhang, M. Hu, J.P. Strachan, M. Barnell,
Q. Wu, H. Wu, R.S. Williams, Q. Xia, J.J. Yang, Fully memristive neural networks for pattern
classification with unsupervised learning. Nat. Electron. 1(2), 137–145 (2018)
29. E. Linn, R. Rosezin, S. Tappertzhofen, U. Bottger, R. Waser, Beyond von Neumann—logic
operations in passive crossbar arrays alongside memory operations. Nanotechnology 23(30),
305205 (2012)
30. S. Gao, G. Yang, B. Cui, S. Wang, F. Zeng, C. Song, F. Pan, Realisation of all 16 Boolean logic
functions in a single magnetoresistance memory cell. Nanoscale 8(25), 12819–12825 (2016)
31. W. Zhao, E. Belhaire, C. Chappert, P. Mazoyer, Spin transfer torque (STT)-MRAM-based
runtime reconfiguration FPGA circuit. ACM Trans. Embed. Comput. Syst. 9(2), 14:1–14:16
(2009)
32. C.J. Lin, S.H. Kang, Y.J. Wang, K. Lee, X. Zhu, W.C. Chen, X. Li, W.N. Hsu, Y.C. Kao, M.T.
Liu, W.C. Chen, Y. Lin, M. Nowak, N. Yu, L. Tran, 45 nm low power CMOS logic compatible
embedded STT MRAM utilizing a reverse-connection 1T/1MTJ Cell, in IEEE International
Electron Devices Meeting (2009), pp. 1–4
33. E. Deng, Design and development of low-power and reliable logic circuits based on spin-
transfer torque magnetic tunnel junctions, Ph.D. dissertation, Grenoble Alpes University,
Grenoble, France (2017)
34. Y. Gang, W. Zhao, J.-O. Klein, C. Chappert, P. Mazoyer, A high-reliability, low-power magnetic
full adder. IEEE Trans. Magn. 47(11), 4611–4616 (2011)
35. E. Deng, Y. Zhang, W. Kang, B. Dieny, J.-O. Klein, G. Prenat, W. Zhao, Synchronous 8-bit
non-volatile full-adder based on spin transfer torque magnetic tunnel junction. IEEE Trans.
Circuits Syst. I Regul. Pap. 62(7), 1757–1765 (2015)
36. A. Mochizuki, H. Kimura, M. Ibuki, T. Hanyu, TMR-based logic-in-memory circuit for low-
power VLSI. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E88-A(6), 1408–1415
(2005)
37. W. Zhao, E. Belhaire, C. Chappert, F. Jacquet, P. Mazoyer, New non-volatile logic based on
spin-MTJ. Nanotechnology 205(6), 1373–1377 (2008)
38. S. Onkaraiah, M. Reyboz, F. Clermidy, J.-M. Portal, M. Bocquet, C. Muller, Hraziia, C. Anghel,
A. Amara, Bipolar ReRAM based non-volatile flip-flops for low-power architectures, in IEEE
International New Circuits and Systems Conference (2012), pp. 417–420
39. D. Chabi, W. Zhao, E. Deng, Y. Zhang, N.B. Romdhane, J.-O. Klein, C. Chapert, Ultra low
power magnetic flip-flop based on checkpointing/power gating and self-enable mechanisms.
IEEE Trans. Circuits Syst. I Regul. Pap. 61(6), 1755–1765 (2014)
9 Spintronic Logic-in-Memory Paradigms and Implementations 229
40. W. Zhao, M. Moreau, E. Deng, Y. Zhang, J.-M. Portal, J.-O. Klein, M. Bocquet, H. Aziza,
D. Deleruyelle, C. Muller, D. Querlioz, N.B. Romdhane, D. Ravelosona, C. Chappert, Syn-
chronous non-volatile logic gate design based on resistive switching memories. IEEE Trans.
Circuits Syst. I Regul. Pap. 61(2), 443–454 (2014)
41. E. Deng, Y. Zhang, J.-O. Klein, D. Ravelsona, C. Chappert, W. Zhao, Low power magnetic
full-adder based on spin transfer torque MRAM. IEEE Trans. Magn. 49(9), 4982–4987 (2013)
42. S. Matsunaga, J. Hayakawa, S. Ikeda, K. Miura, H. Hasegawa, T. Endoh, H. Ohno, T. Hanyu,
Fabrication of a nonvolatile full adder based on logic-in-memory architecture using magnetic
tunnel junctions. Appl. Phys. Express 1(9), 091301 (2008)
43. T. Hanyu, T. Endoh, D. Suzuki, H. Koike, Y. Ma, N. Onizawa, M. Natsui, S. Ikeda, H. Ohno,
Standby-power-free integrated circuits using MTJ-based VLSI computing. Proc. IEEE 104(10),
1844–1863 (2016)
44. W. Kang, H. Wang, Z. Wang, Y. Zhang, W. Zhao, In-memory processing paradigm for bitwise
logic operations in STT-MRAM. IEEE Trans. Magn. 53(11), 6202404 (2017)
45. S. Jain, A. Ranjan, K. Roy, A. Raghunathan, Computing in memory with spin-transfer torque
magnetic RAM. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(3), 470–483 (2018)
46. Z. He, S. Angizi, D. Fan, Exploring STT-MRAM based in-memory computing paradigm with
application of image edge extraction, in IEEE International Conference on Computer Design
(2017), pp. 439–446
47. Z. He, S. Angizi, F. Parveen, D. Fan, High performance and energy-efficient in-memory comput-
ing architecture based on SOT-MRAM, in IEEE/ACM International Symposium on Nanoscale
(2017), pp. 97–102
48. D. Fan, Z. He, S. Angizi, Leveraging spintronic devices for ultra-low power in-memory com-
puting logic and neural network, in 2017 IEEE 60th International Midwest Symposium on
Circuits and Systems (2017), pp. 1109–1112
49. H. Zhang, W. Kang, L. Wang, K.L. Wang, W. Zhao, Stateful reconfigurable logic via a single
voltage-gated spin Hall effect driven magnetic tunnel junction in a spintronic memory. IEEE
Trans. Electron Devices 64(10), 4295–4301 (2017)
50. W. Kang, H. Zhang, P. Ouyang, Y. Zhang, W. Zhao, Programmable stateful in-memory com-
puting paradigm via a single resistive device, in IEEE International Conference on Computer
Design (2017), pp. 613–616
51. R.A. Buhrman, D.C. Ralph, C.-F. Pai, L. Liu, Electrically gated three-terminal circuits and
devices based on spin hall torque effects in magnetic nanostructures apparatus, methods and
applications, U.S. Patent, no. US9230626B2, March 2016
52. H. Yoda, N. Shimomura, Y. Ohsawa, S. Shirotori, Y. Kato, T. Inokuchi, Y. Kamiguchi,
B. Altansargai, Y. Saito, K. Koi, H. Sugiyama, S. Oikawa, M. Shimizu, M. Ishikawa, K.
Ikegami, A. Kurobe, Voltage-control spintronics memory (VoCSM) having potentials of ultra-
low energy-consumption and high-density, in IEEE International Electron Devices Meeting
(2016), pp. 27.6.1–27.6.4
53. J.E. Hirsch, Spin Hall effect. Phys. Rev. Lett. 83(9), 1834–1837 (1999)
54. L. Liu, C.F. Pai, Y. Li, H.W. Tseng, D.C. Ralph, R.A. Buhrman, Spin-torque switching with
the giant spin Hall effect of tantalum. Science 336(6081), 555–558 (2012)
55. W.G. Wang, M. Li, S. Hageman, C.L. Chien, Electric-field-assisted switching in magnetic
tunnel junctions. Nat. Mater. 11(1), 64–68 (2012)
56. W. Kang, Y. Ran, Y. Zhang, W. Lv, W. Zhao, Modeling and exploration of the voltage con-
trolled magnetic anisotropy effect for the next-generation low-power and high-speed MRAM
applications. IEEE Trans. Nanotechnol. 16(3), 387–395 (2017)
57. H. Zhang, W. Kang, K. Cao, B. Wu, Y. Zhang, W. Zhao, Spintronic processing unit in spin trans-
fer torque magnetic random access memory. IEEE Trans. Electron Devices 66(4), 2017–2022
(2019)