100% found this document useful (1 vote)
842 views

Applications of Emerging Memory Technology - Beyond Storage

A collection of papers

Uploaded by

nameless wonder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
842 views

Applications of Emerging Memory Technology - Beyond Storage

A collection of papers

Uploaded by

nameless wonder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 244

Springer Series in Advanced Microelectronics 63

Manan Suri Editor

Applications of
Emerging Memory
Technology
Beyond Storage
Springer Series in Advanced Microelectronics

Volume 63

Series Editors
Kukjin Chun, Department of Electrical and Computer Engineering, Seoul National
University, Seoul, Korea (Republic of)
Kiyoo Itoh, Hitachi Ltd., Tokyo, Japan
Thomas H. Lee, Department of Electrical Engineering CIS-205, Stanford
University, Stanford, CA, USA
Rino Micheloni, Torre Sequoia, II piano, PMC-Sierra, Vimercate (MB), Italy
Takayasu Sakurai, The University of Tokyo, Tokyo, Japan
Willy M. C. Sansen, ESAT-MICAS, Katholieke Universiteit Leuven, Leuven,
Belgium
Doris Schmitt-Landsiedel, Lehrstuhl fur Technische Elektronik, Technische
Universität München, Munich, Germany
The Springer Series in Advanced Microelectronics provides systematic information
on all the topics relevant for the design, processing, and manufacturing of
microelectronic devices. The books, each prepared by leading researchers or
engineers in their fields, cover the basic and advanced aspects of topics such as
wafer processing, materials, device design, device technologies, circuit design,
VLSI implementation, and sub-system technology. The series forms a bridge
between physics and engineering, therefore the volumes will appeal to practicing
engineers as well as research scientists.

More information about this series at http://www.springer.com/series/4076


Manan Suri
Editor

Applications of Emerging
Memory Technology
Beyond Storage

123
Editor
Manan Suri
Department of Electrical Engineering
Indian Institute of Technology Delhi
New Delhi, Delhi, India

ISSN 1437-0387 ISSN 2197-6643 (electronic)


Springer Series in Advanced Microelectronics
ISBN 978-981-13-8378-6 ISBN 978-981-13-8379-3 (eBook)
https://doi.org/10.1007/978-981-13-8379-3
© Springer Nature Singapore Pte Ltd. 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
He who sees inaction in action, and action
in inaction
Is Spiritually wise, transcendentally situated
a perfect performer of all actions
(Shrimad Bhagwad Gita, Chapter 4, Verse 18)
In loving memory of
Harbans Kaur, Gyan Chand, Raj Rani, and
Jagdish Chander
Preface

Let me try to keep this Preface short and simple so that readers can save time for
actual technical content.
If Data is the question? Memory is the answer!
Over the last few decades, the quanta of memory, memory-related devices, and
circuits on most silicon dies have increased manifold and will further increase in the
time to come. This leads us to the question; if the presence of memory is becoming
more and more profound, why not exploit it for multiple applications beyond
simple conventional storage? The emergence of different flavors of new memory
materials and devices with diverse underlying physics has opened many new
application opportunities. The contributions in this edition are an effort in the
direction of showcasing applications beyond simple 1/0 storage that can be realized
using emerging nanoscale, non-volatile memory devices, materials, and circuits.
The present volume is a work in progress, and we hope to improve it further with
your feedback. The book in its current form may be used as a research reference
text as well as reading material for advanced courses. I would like to express deep
gratitude to all the contributing researchers and their teams for presenting excellent
technical content for this project and also the project co-ordinator for sincere efforts
in making this edition possible.

New Delhi, India Manan Suri

ix
Contents

1 Towards Spintronics Nonvolatile Caches . . . . . . . . . . . . . . . . . . . . . 1


Zhaohao Wang, Bi Wu, Chao Wang, Wang Kang and Weisheng Zhao
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM
and Flip-Flop: Circuit Implementations . . . . . . . . . . . . . . . . . . . . . . 29
Swatilekha Majumdar, Sandeep Kaur Kingra and Manan Suri
3 Phase Change Memory for Physical Unclonable Functions . . . . . . . . 59
Nafisa Noor and Helena Silva
4 Applications of Resistive Switching Memory as Hardware
Security Primitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Roberto Carboni and Daniele Ielmini
5 Memristive Biosensors for Ultrasensitive Diagnostics
and Therapeutics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Ioulia Tzouvadaki, Giovanni De Micheli and Sandro Carrara
6 Optimized Programming for STT-MTJ-Based TCAM
for Low-Energy Approximate Computing . . . . . . . . . . . . . . . . . . . . 159
Ashwani Kumar and Manan Suri
7 Greedy Edge-Wise Training of Resistive Switch Arrays . . . . . . . . . . 177
Doo Seok Jeong
8 mMPU—A Real Processing-in-Memory Architecture
to Combat the von Neumann Bottleneck . . . . . . . . . . . . . . . . . . . . . 191
Nishil Talati, Rotem Ben-Hur, Nimrod Wald, Ameer Haj-Ali,
John Reuben and Shahar Kvatinsky
9 Spintronic Logic-in-Memory Paradigms and Implementations . . . . . 215
Wang Kang, Erya Deng, Zhaohao Wang and Weisheng Zhao

xi
Editor, Project Co-ordinator, and Contributors

About the Editor

Manan Suri leads the NVM and Neuromorphic


Research Group at IIT-Delhi. He is a Professor at the
Department of Electrical Engineering, IIT-Delhi and the
Founder of CYRAN AI Solutions. His research interests
include Semiconductor Non-Volatile Memory
(NVM) Technology and its Advanced Applications
(Neuromorphic, AI, Security, Computing, Sensing etc.)
He has been globally recognized as a leading DeepTech
Innovator. He was selected by MIT Technology Review
as one of the Top 35 Global Innovators Under 35
(MIT-TR 35 Global List) and one of the Top 10 Indian
Innovators Under 35 (MIT-TR 35 India List). He received
the prestigious IEEE EDS Early Career Award (2018),
Young Scientist Award (2017) from The National
Academy of Sciences, Young Engineers Award (2016),
from The Institution of Engineers, and Laureat du Prix
(2014) from the French Nanosciences Foundation. He has
filed multiple patents, authored 65+ publications and
delivered 45+ Invited talks. Dr. Suri is a visiting scientist
at CNRS France. He serves as an advisor to leading
AI/Neuromorphic/NVM hardware companies. and gov-
ernment bodies. Prior to joining IIT-Delhi, he has worked
at NXP Semiconductors, Belgium and CEA-LETI, France
Dr. Suri received his Ph.D. from INP-Grenoble, France
and Masters/Bachelors from Cornell University, USA.
email: [email protected]
http://web.iitd.ac.in/*manansuri/

xiii
xiv Editor, Project Co-ordinator, and Contributors

Project Co-ordinator

Ms. Sandeep Kaur Kingra NVM and Neuromorphic


Research Group, Department of Electrical Engineering,
Indian Institute of Technology Delhi

Contributors

Rotem Ben-Hur received her B.Sc. in electrical


engineering from the Technion - Israel Institute of
Technology, in 2014. In 2012, she joined Elbit Systems
as FPGA Designer. Since 2015, she is Graduate Student
working toward a Ph.D. (direct path) at the Andrew and
Erna Viterbi Faculty of Electrical Engineering,
Technion - Israel Institute of Technology. Her current
research is focused on novel architectures for logic with
emerging memory technologies.

Roberto Carboni received his B.S. and M.S. in


electrical engineering from Politecnico di Milano,
Milan, Italy, in 2013 and 2016, respectively, where
he is currently pursuing his Ph.D. in electrical engi-
neering. His main research interests are the character-
ization and modeling of resistive switching memory
(RRAM) and spin-transfer torque magnetic memory
(STT-MRAM) for memory and computing applica-
tions.
Editor, Project Co-ordinator, and Contributors xv

Sandro Carrara is IEEE Fellow and also the recipient


of the IEEE Sensors Council Technical Achievement
Award. He is Faculty at EPFL, Lausanne, Switzerland,
and former Professor at the Universities of Genoa and
Bologna, Italy. He holds a Ph.D. in biochemistry and
biophysics, a Master's in physics, and a diploma in
electronics. His scientific interests are on electrical
phenomena of nano-biostructured films and include
CMOS design of biochips based on proteins and DNA.
Along his career, he published seven books, one as
author with Springer on Bio/CMOS interfaces and, more
recently, a Handbook of Bioelectronics with Cambridge
University Press. He has more than 250 scientific
publications and is author of 13 patents. He is now
Editor-in-Chief of the IEEE Sensors Journal, Founder
and Editor-in-Chief of the journal BioNanoScience by
Springer, and Associate Editor of IEEE Transactions on
Biomedical Circuits and Systems. He is Member of the
IEEE Sensors Council and was Member of the Board of
Governors (BoG) of the IEEE CAS Society. He has been
appointed two times as IEEE Distinguished Lecturer. His
work received several international recognitions as
best-cited papers and best conference papers. He has
been General Chairman of the Conference IEEE BioCAS
2014.

Giovanni De Micheli is Professor and Director of the


Institute of Electrical Engineering at EPFL, Lausanne,
Switzerland. He is Fellow of ACM and IEEE, Member
of the Academia Europaea, and International Honorary
Member of the American Academy of Arts and Sciences.
His research interests include several aspects of design
technologies for integrated circuits and systems, such as
synthesis for emerging technologies, networks on chips,
and 3D integration. His citation h-index is 93 according
to Google Scholar. He is a member of the Scientific
Advisory Board of IMEC (Leuven, B), CfAED
(Dresden, D), and STMicroelectronics. He is the recip-
ient of the 2016 IEEE/CS Harry Goode Award for
seminal contributions to design and design tools of
networks on chips, the 2016 EDAA Lifetime
Achievement Award, the 2012 IEEE/CAS Mac Van
Valkenburg Award for contributions to theory, practice,
xvi Editor, Project Co-ordinator, and Contributors

and experimentation in design methods and tools, and the


2003 IEEE Emanuel Piore Award for contributions to
computer-aided synthesis of digital systems. He received
also the D. Pederson Award for the best paper on the
IEEE Transactions on CAD in 2018 and 1987, as well as
several Best Paper Awards. He has been serving IEEE in
several capacities, namely Division 1 Director (2008–
2009), Co-founder and President Elect of the IEEE
Council on EDA (2005–2007), President of the
IEEE CAS Society (2003), Editor-in-Chief of the IEEE
Transactions on CAD/ICAS (1997–2001). He has been
Chair of several conferences, including MEMOCODE
(2014), DATE (2010), pHealth (2006), VLSI-SOC
(2006), DAC (2000), and ICCD (1989).

Erya Deng was born in China, 1989. She received the


Ph.D. degree in nano-electronics and nano-technologies
from the University of Grenoble Alpes, France, in
2017. She received the M.S. degree in electronics from
the University of Paris-Sud, France, in 2013. Her
interest includes hybrid CMOS/magnetic circuits for
memory and logic applications.

Ameer Haj-Ali is currently a Ph.D. student in the


Department of Electrical Engineering and Computer
Science, UC Berkeley. He completed his M.Sc. studies
at the Andrew and Erna Viterbi Faculty of Electrical
Engineering at the Technion – Israel Institute of
Technology in 2018. He received the B.Sc. in computer
engineering, summa cum laude, in 2017, from the
Technion - Israel Institute of Technology. From 2015 to
2016, he was with Mellanox Technologies as a chip
designer. His current research is focused on hardware/
software co-design, auto-tuning, machine learning, rein-
forcement learning, ASIC design, high-performance
computing, and hardware for machine learning.
Editor, Project Co-ordinator, and Contributors xvii

Daniele Ielmini is Full Professor at the Dipartimento di


Elettronica, Informazione, e Bioingegneria, Politecnico
di Milano. He conducts research on emerging
nano-electronic devices, such as phase-change memory
(PCM) and resistive switching memory (RRAM), and
their application in computing. He received the Intel
Outstanding Researcher Award in 2013, the ERC
Consolidator Grant in 2014, and the IEEE EDS
Rappaport Award in 2015.

Doo Seok Jeong received his B.E. and M.E. in


materials science from Seoul National University, in
2002 and 2005, respectively, and the Ph.D. in materials
science from RWTH Aachen, Germany, in 2008. He was
with the Korea Institute of Science and Technology, from
2008 to 2018. Since 2018, he has been an associate
professor with Hanyang University. His research interest
includes digital implementation of fully reconfigurable
spiking neural networks with embedded learning algo-
rithms of the locality. New learning algorithms suitable
for digital neuromorphic hardware are of particular
interest for the moment. He has authored/co-authored
more than 90 papers that have been cited more than 5800
times.

Wang Kang (S’12, M’15) received the B.S. in elec-


tronic and information engineering from Beihang
University, Beijing, China, in 2009. He received the
double Ph.D. in microelectronics from Beihang
University, Beijing, China, and in physics from the
University of Paris-Sud, Paris, France, in 2014, respec-
tively. He is now Associate Professor at School of
Microelectronics, Beihang University, Beijing, China.
His research interest includes spintronic devices, circuits,
architectures, and applications. He has authored or
co-authored two chapters, more than 90 technical papers,
and over 20 Chinese patents. He served as Guest Editor
of Microelectronics Journal.
xviii Editor, Project Co-ordinator, and Contributors

Sandeep Kaur Kingra is currently pursuing Ph.D.


from the Department of Electrical Engineering at Indian
Institute of Technology Delhi, India. She received her
B.Tech. in electronics and communication engineering
and M.Tech. in microelectronics in 2011 and 2015,
respectively. Her current areas of interest are emerging
non-volatile memories, characterization, and computing
applications of emerging memories.

Ashwani Kumar received his B.Tech. in electronics


and communication engineering and M.Tech. in
micro-electronics in 2010 and 2013, respectively. He
is currently working toward the Ph.D. in Electrical
Engineering Department with Indian Institute of
Technology Delhi (IITD), India. His current research
interests include emerging memristive technology for
imaging applications.

Shahar Kvatinsky is Assistant Professor at the Andrew


and Erna Viterbi Faculty of Electrical Engineering,
Technion - Israel Institute of Technology. He received
his B.Sc. in computer engineering and applied physics
and MBA in 2009 and 2010, respectively, both from the
Hebrew University of Jerusalem, and his Ph.D. in
electrical engineering from the Technion - Israel
Institute of Technology in 2014. From 2006 to 2009,
he was with Intel as Circuit Designer and was
Post-Doctoral Research Fellow at Stanford University
from 2014 to 2015. He is Editor of Microelectronics
Journal and has been the recipient of 2015 IEEE
Guillemin-Cauer Best Paper Award, 2015 Best Paper
of Computer Architecture Letters, Viterbi Fellowship,
Jacobs Fellowship, ERC Starting Grant, 2017 Pazy
Memorial Award, 2014 and 2017 Hershel Rich Technion
Editor, Project Co-ordinator, and Contributors xix

Innovation Awards, 2013 Sanford Kaplan Prize for


Creative Management in High Tech, 2010 Benin Prize,
and seven Technion excellence teaching awards. His
current research is focused on circuits and architectures
with emerging memory technologies and design of
energy-efficient architectures.

Swatilekha Majumdar is currently pursuing Ph.D.


from Indian Institute of Technology Delhi, India. She
received her M.Tech. from IIIT, Delhi, in VLSI and
embedded systems in 2014 and B.Tech. from IP
University, Delhi, in electronics and communication
in 2011. She visited National Chiao Tung University,
Taiwan, in 2017 as PhD exchange student, and has
worked with ST Microelectronics from 2013–2014. Her
research interests include NVSRAM applications. She
was conferred with IEEE Student Fellowship at the
32nd IEEE VLSID Conference and has been associated
with IEEE WIE since 2019.

Nafisa Noor received her B.S. in electrical and


electronic engineering from Bangladesh University of
Engineering and Technology (BUET), Dhaka,
Bangladesh, in 2007. She started working as System
Engineer at a leading telecommunication operator
company, Grameenphone Ltd., Dhaka, Bangladesh,
from June 2007. She joined the Department of
Electrical and Electronic Engineering at Ahsanullah
University of Science and Technology (AUST) in
Dhaka, Bangladesh, as Lecturer in October 2008. She is
currently pursuing her Ph.D. in electrical engineering at
University of Connecticut, Storrs, CT, USA.
xx Editor, Project Co-ordinator, and Contributors

John Reuben received B.E. (Hon’s) from BITS,


Pilani, in 2004 and Master’s and Ph.D. from VIT
University, India, in 2008 and 2015, respectively. He
was Post-Doctoral Researcher in Technion, Israel, from
January 2017 to January 2018. He is currently working
as Post-Doctoral Researcher in Friedrich Alexander
University, Erlangen, Germany. His research interests
are RRAMs, memristive logic, and beyond-CMOS
computing.

Helena Silva received her B.Eng. in engineering


physics from Universidade Técnica de Lisboa, Lisboa,
Portugal, in 1998, and Ph.D. in applied physics from
Cornell University, Ithaca, NY, USA, in 2005. She is
currently Associate Professor in the Department of
Electrical and Computer Engineering at University of
Connecticut, Storrs, CT, USA.

Nishil Talati is currently working toward the Ph.D. in


the Computer Science and Engineering Department,
University of Michigan, Ann Arbor. His current
research interests include computer architecture, main
memory systems, and emerging memory technologies.
He received his B.Eng. in electrical engineering from
BITS, Pilani, India, in 2016, and M.Sc. in electrical
engineering from the Technion - Israel Institute of
Technology in 2018.
e-mail: [email protected]
Editor, Project Co-ordinator, and Contributors xxi

Ioulia Tzouvadaki received her B.Sc. in physics from


National and Kapodistrian University of Athens
(UoA) and M.Sc. in microsystems and nano-devices
from National Technical University of Athens (NTUA).
Her M.Sc. thesis concerned the computational study
and simulation of polymer nano-composite materials,
within the Computational Materials Science and
Engineering (CoMSE) research group of the School
of Chemical Engineering at NTUA. She received her
Ph.D. in microsystems and microelectronics at École
Polytechnique Fédérale de Lausanne (EPFL). In her
Ph.D. research at the Integrated System Laboratory
(LSI), she focused on the fabrication and characteriza-
tion of nano-structures and their implementation as
ultrasensitive nano-biosensors in both diagnostics and
therapeutics. She joined Stanford University as
Post-Doctoral Fellow working on the design of an
electronic platform for integration with wearable sweat
biomarker sensors for multi-panel, continuous moni-
toring to enhance human health and performance.
Currently, she is Research Fellow at Southampton
University.

Nimrod Wald received his B.Sc. in electrical engi-


neering and physics in 2013, and his M.Sc. in electrical
engineering in 2019, both from the Technion - Israel
Institute of Technology, Haifa. Between 2011 and
2016, he was with Qualcomm Inc. in a hardware design
position, and later in a hardware architecture position in
the area of performance analysis. Currently, he is with a
start-up company in the field of EDA for hardware
development.
xxii Editor, Project Co-ordinator, and Contributors

Chao Wang received his B.S. in Electronics and


Information Engineering from Beihang University,
Beijing, China, in 2018, where he is currently pursuing
his M.S. with the Department of Microelectronics. His
research interests include the modeling of non-volatile
nano-devices, the design of new non-volatile memories
and logic circuits, and the optimization issue of the
spintronics memory architectures.

Zhaohao Wang (S’12–M’16) received his B.S. in


microelectronics from Tianjin University, China, in
2009, M.S. in microelectronics from Beihang Univer-
sity, China, in 2012, and Ph.D. in physics from University
Paris-Saclay, France, in 2015. He is currently Assistant
Professor at School of Microelectronics, Beihang
University, China. His current research interests include
the modeling of non-volatile nano-devices and the design
of new non-volatile memories and logic circuits. He has
authored or co-authored more than 50 technical papers and
holds more than 10 Chinese patents.

Bi Wu received his B.S. and M.S. from China


University of Mining and Technology, Xuzhou,
China, and Beihang University, Beijing, China, respec-
tively. He is currently pursuing Ph.D. in electrical
engineering at Beihang University. In 2017, he won the
China National Scholarship for doctoral students which
is awarded by Ministry of Education of China. His
research interests include circuit-level and architecture-
level design and optimization of STT-MRAM,
SOT-MRAM, the corresponding reliability analysis
and improvement, etc.
Editor, Project Co-ordinator, and Contributors xxiii

Weisheng Zhao (M’07–SM’14-F’18) received his


Ph.D. in physics from the University of Paris-Sud,
Paris, France, in 2007. He was Research Associate with
the CEA’s Embedded Computing Laboratory, France,
from 2007 to 2009. From 2009 to 2014, he was Tenured
Scientist with CNRS, France. He is currently
Distinguished Professor and Director of the Fert
Beijing Research Institute, Beihang University,
Beijing, China. He has authored or co-authored two
books, more than 200 scientific papers, and holds four
international patents and more than 30 Chinese patents.
His research focused on the hybrid integration of
emerging nano-devices (spintronics, nanotube devices,
and memristors) with complementary metal-oxide-
semiconductor circuits toward logic and memory appli-
cations. He is Editorial Board Member of Scientific
Reports, and he is also Associate Editor of the IEEE
Transactions on Nano-technology and IET Electronics
Letters.
Chapter 1
Towards Spintronics Nonvolatile Caches

Zhaohao Wang, Bi Wu, Chao Wang, Wang Kang and Weisheng Zhao

Abstract Non-volatile (NV) cache is desired for overcoming the power and speed
bottlenecks of the modern static random access memory (SRAM). A promising
candidate for constructing the NV cache is the spin transfer torque magnetic RAM
(STT-MRAM), which is featured with low power, fast speed, high density and nearly
unlimited endurance. In this chapter, we will review the efforts made to realize
the STT-MRAM based NV cache, ranging from architecture to device levels. In
addition, the application potential of emerging spintronics technologies, such as spin
orbit torque (SOT) and voltage-controlled magnetic anisotropy (VCMA), will be
discussed in terms of their benefits and challenges.

1.1 Introduction

In a computing system, processors and memories are the key modules which, respec-
tively, perform arithmetic operations and store data/instructions. Therefore, the com-
puting efficiency is strongly dependent on both the execution speed of the processor
and the access speed of the memory. Unfortunately, it is often that unsatisfying
match between these two speeds exists in a typical computing architecture. Gen-

Z. Wang (B) · C. Wang · W. Kang · W. Zhao


School of Microelectronics, Beijing Advanced Innovation Center for Big Data and Brain
Computing (BDBC), Fert Beijing Research Institute, Beihang University, Beijing 100191, China
e-mail: [email protected]
C. Wang
e-mail: [email protected]
W. Kang
e-mail: [email protected]
W. Zhao
e-mail: [email protected]
B. Wu
School of Electronics and Information Engineering, Fert Beijing Research Institute, Beihang
University, Beijing 100191, China
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2020 1


M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_1
2 Z. Wang et al.

Fig. 1.1 a Typical memory (a)


hierarchy of a modern
computer. b Typical cache
CPU
hierarchy equipped with Register
8-core CPU (Central

Speed, Cost
Cache volatile
Processing Unit) [1]
Main Memory

Hard Disk Drive / Solid State Drive non-volatile

Capacity

(b)
8 Core CPU
Core Core Core Core
Pipeline Pipeline Pipeline Pipeline
L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D
L2 L2 L2 L2
Main
Memory
Shared L3 Cache or LLC
(DRAM)

L2 L2 L2 L2
L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D
Core Core Core Core
Pipeline Pipeline Pipeline Pipeline

erally, accessing the memories requires much more latency than dealing with the
instructions in the processors. As a result, actually, the performance of a computing
system is mainly determined by the memory bandwidth rather than the processor
frequency. This issue is known as “memory wall” in modern computers. Take the
state-of-the-art technologies, for instance, the base frequency of an Intel Core i7 pro-
cessor can be as high as 3.70 GHz, whereas the speed of a Samsung DDR3 dynamic
random access memory (DRAM) is 1600 Mbps. To overcome the “memory wall”,
the modern computers employ the memory hierarchy shown in Fig. 1.1a, where var-
ious types of memories are organized at different levels according to the capacity
and speed. The most frequently-accessed data or instructions are copied into several
high-speed memories, which are embedded into or very close to the processors (e.g.,
CPU in Fig. 1.1a). These memories are called caches which efficiently reduce the
speed gap between the processor and main memory. Furthermore, with the rise of
the multi-core processor, the computing efficiency is improved meanwhile the cache
capacity needs to be increased to accommodate the more data and instructions. Sim-
ilar to Fig. 1.1a, caches are also organized as a hierarchy of multiple levels including
L1–L2 and shared L3 (last level cache, or LLC), as shown in Fig. 1.1b. L1 cache
requires an access speed as fast as possible. By contrast, in an LLC the large capacity
is desirable but the slower speed is tolerated.
As mentioned above, the cache needs to have a faster speed than the main memory.
This difference can be explained by the bit-cell structures of a cache and a main
memory shown in Fig. 1.2. The cache and main memory are constructed with the
1 Towards Spintronics Nonvolatile Caches 3

(a) WL (b)
VDD Word Line

M2 M4
M5 M6

Q
Q

M1 M3 Bit Line
BL BL

Fig. 1.2 Schematic bit-cell structures of the conventional a SRAM and b DRAM

static random access memory (SRAM) and DRAM, respectively. The SRAM bit-cell
consists of six transistors. The 1-bit data is read or written into two cross-coupled
inverters (M1–M4) through two controlling transistors (M5–M6). The DRAM bit-
cell is composed of an access transistor connected with a capacitor. The read and
write operations of the data are performed through discharging and charging the
capacitor. Since the charges on the capacitor need to be maintained by a periodical
refresh, the DRAM is accessed more slowly than the SRAM.
Despite fast access, the SRAM-based cache is not a perfect memory due to the fol-
lowing two issues. First, the SRAM occupies much larger area than the DRAM due
to more transistors. In a modern microprocessor, the SRAM-based caches occupy
more than one half of the chip area. Moreover, the capacity of the SRAM-based
cache is very limited compared with the DRAM-based main memory. For instance,
the capacities of the cache and maim memory in a ThinkPad-X1 Carbon laptop are,
respectively, 8 MB and 16 GB. Second, the SRAM is volatile and consumes consid-
erable energy. Especially, the leakage current of the transistors cannot be eliminated
since the power supply has to be always-on for keeping the data. With the scal-
ing of the CMOS technology, the leakage current has become the major source of
the chip power. Especially in a multi-core system, the leakage current of the large-
capacity LLC consumes most of the total power consumption. These bottlenecks
severely impede the sustainable optimization of the SRAM-based cache. Although
the embedded DRAM (eDRAM) is proposed and used as the large-capacity LLC, it
is still difficult to reduce the power consumption of the eDRAM.
To develop the high-performance cache beyond the SRAM, both the academia
and industry are exploiting the nonvolatile memory (NVM) technologies which offer
the advantages of high density and low power over the volatile SRAM. In particu-
lar, the data can be retained into the NVM cell without the need of power supply,
promising to achieve nearly zero leakage power consumption. Among the various
NVM technologies, Flash has been widely commercialized in the application of mass
storage (e.g., USB flash drive and solid state drive) [2]. However, it is unsatisfactory
to construct a Flash-based cache since the Flash suffers from low write endurance
(∼105 cycles) and slow access operation (microsecond to millisecond). Phase-change
RAM (PCRAM) shows higher endurance (∼109 cycles) than the Flash, meanwhile,
4 Z. Wang et al.

Table 1.1 Characteristics of the mainstream memory devices [10]


SRAM DRAM NAND- STT- FeRAM PCRAM RRAM
Flash MRAM
Endurance ∞ ∞ 105 ∞ 1014 109 109
(cycles)
Read/Write <1 30 100/106 2–30 30 10/100 1–100
time (ns)
Density Low Medium High Medium Low High High
Write Medium Medium High Medium Medium Medium Medium
power

the storage density is sufficiently high [3]. But the write operation of the PCRAM
is achieved through heating and cooling processes, which requires a large latency of
hundreds of nanoseconds and cannot satisfy the demand of the cache. Alternatively,
the PCRAM is widely accepted as a competitive candidate for the nonvolatile main
memory. For the ferroelectric RAM (FeRAM), the nonvolatile information is rep-
resented by the ferroelectric polarization [4]. Thus, the device size has to be large
enough to store adequate charges and provide detectable signals. As a result, the stor-
age density of the ferroelectric memory is much lower than the Flash and PCRAM. In
addition, the read operation of the ferroelectric memory is destructive. These draw-
backs prohibit the ferroelectric memory from being used as cache. In sum, it is a
challenging task to develop a NVM device for meeting all the requirements (den-
sity, speed, power, etc.) of the cache. Currently, relatively promising candidates for
the NV-caches are the magnetic RAM (MRAM) [5, 6] and resistive RAM (RRAM)
[7]. However, limited endurance and uncontrollable process variation are the main
bottlenecks of the RRAM. By contrast, the MRAM is more competitive as it offers
almost unlimited endurance and good compatibility with CMOS process. Especially,
several MRAM chips have shown excellent performance while using as LLC [8, 9].
Table 1.1 summarizes the characteristics of the abovementioned memory devices
[10].
This chapter will provide an overview of the MRAM-based NV-cache. Despite
exciting merits of the MRAM, currently both the write current and write latency of
the MRAM are still higher than those of the SRAM. Thereby design considerations
for the MRAM cache are indispensable. Here we present the optimization strategies
for the MRAM cache. The potential of several emerging MRAMs in the application
of NV-cache is evaluated as well.

1.2 MRAM Background

The storage element of the MRAM is the magnetic tunnel junction (MTJ) shown in
Fig. 1.3a. The core structure of a MTJ consists of two ferromagnetic layers separated
by a tunnel barrier. Two ferromagnetic layers are called pinned layer (or reference
layer) and free layer, respectively. The magnetization direction of the pinned layer is
fixed, whereas that of the free layer is switchable. Once a bias voltage is applied to the
1 Towards Spintronics Nonvolatile Caches 5

(a) (b)
Resistance
AP state

Pinned layer

Tunnel barrier
Magnetic field
Free layer

P state

Fig. 1.3 a Schematic structure of the magnetic tunnel junction. b Tunneling magnetoresistance
effect: the MTJ resistance can be switched between high and low values by reversing the magneti-
zation of the free layer

MTJ, the electrons flow through the barrier in the manner of quantum tunneling. As a
result, the tunnel current is formed. The magnitude of the tunnel current is dependent
on the relative magnetization orientation between the pinned layer and free layer. If
the magnetization directions of two ferromagnetic layers are parallel (P) to each
other, a large tunnel current is induced so that the MTJ is in the low-resistance state.
In contrast, the MTJ is in the high-resistance state if the magnetization orientation
of the free layer is switched to be antiparallel (AP) to that of the pinned layer (see
Fig. 1.3b). The 1-bit nonvolatile data can be represented by the MTJ resistance states.
The tunneling magnetoresistance (TMR) ratio is expressed as follows:

RAP − RP
TMR = × 100% (1.1)
RP

where R A P and R P are the MTJ resistances for AP and P states, respectively.
The write operation of the MRAM is achieved through switching the magne-
tization of the free layer. In the early stage of the MRAM development (around
2000s), the field-induced magnetization switching (FIMS) was used for the data
writing of the MRAM. For instance, the first commercial MRAM product launched
by Freescale employed an improved FIMS-like technology called toggle [11]. But
the FIMS-based schemes are criticized for the high power-consumption and poor
scalability. Alternatively, the pure electrical write operation is strongly desired. In
this context, current-induced spin transfer torque (STT) was proposed for the low-
power and scalable magnetization switching [12–14]. To generate the STT, an electric
current flowing through the MTJ is spin-polarized by the pinned layer. While this
spin-polarized current interacts with the free layer, a spin torque is transferred to
switch the magnetization due to the angular momentum conservation. The direction
of the switched magnetization depends on the polarity of the applied current. So far,
the STT has been intensively studied and become a mainstream write technology
for the MRAM. A number of commercial STT-MRAM products have been contin-
uously released by Everspin [15]. Some attempts have been made to develop the
STT-MRAM-based cache, which will be discussed in the following sections.
6 Z. Wang et al.

The read operation of the MRAM is implemented by comparing the MTJ resis-
tance with a reference value. At the circuit level, this comparison is translated into
the difference of the voltage or current by a sensing amplifier [16–18].
For a MTJ, the easy-axis of the ferromagnetic layers could be in-plane or perpen-
dicular (out-of-plane). Early MTJs were fabricated with the in-plane anisotropy. But
the in-plane MTJ suffers from two drawbacks: first, the anisotropy originates from
the aspect ratio of the elliptical shape. With the scaling of the technology node, it
is increasingly difficult to keep satisfying anisotropy field. Second, for the in-plane
STT-MTJ, the STT has to overcome the demagnetization field which makes no con-
tribution to the thermal stability barrier (), as explained by (1.2)–(1.3). Therefore,
the perpendicular MTJ is preferred for the high-performance MRAM [19].

μ0 Ms Hk V
= (1.2)
2
 
αγ μ0 e Ms
Ic_in ≈ Ms V Hk + (1.3)
μB g 2

where Ic_in is the critical STT current of the in-plane anisotropy MTJ, μ0 is the vac-
uum permeability, Ms is the saturation magnetization, Hk is the anisotropy field, V is
the volume of the free layer, α is the Gilbert damping constant, γ is the gyromagnetic
ratio, e is the electron charge, μ B is the Bohr magneto, g is a device-dependent factor.
It is seen that the term associated with Ms /2 makes no contribution to the thermal
stability barrier.

1.3 STT-MRAM for the Cache Replacement

STT-MRAM is being paid a number of attentions by the academia and industry.


The write circuit of the STT-MRAM is easy to be designed since it only needs
to generate the bipolar current. Moreover, the interfacial perpendicular anisotropy
enables the continuous scaling of the MTJ and thus the decrease of the write current.
More attractive feature is that the STT-MRAM is well compatible with the CMOS
processes. All these progresses, make it possible to design a NV-cache with STT-
MRAM.
The common bit-cell of the MRAM is composed of a MTJ connected with an
access transistor, which is called one-transistor one-MTJ (1T1J) bit-cell [20], as
shown in Fig. 1.4. For constructing a bit-cell array, the transistor gates within the
same row are connected together to form the word line (WL). The transistor sources
within the same column are connected together to form the source line (SL). In the
same way, one terminal of the MTJ is connected to the bit line (BL). To write a bit-
cell, the WL is asserted and a voltage is applied across the BL and SL of the accessed
column. Then a current flows through the accessed MTJ to write the data through the
STT. The current polarity is determined by the data to be written. To read a bit-cell,
the WL is asserted and a small voltage is applied to BL of the accessed column. A
sensing amplifier outputs the stored data represented by the MTJ resistance.
1 Towards Spintronics Nonvolatile Caches 7

(a) (b) SL BL SL BL SL BL

Bit Line WL

Bipolar WL
MTJ
Write Pulse/
Read Bias
Generator Word Line
Transistor
WL
Sense Amp.
Bit Line WL
Source Line
Ref.

Fig. 1.4 1T1J MRAM [20]: a bit-cell structure and periphery. b Cell array

50 4
SRAM
Read Latency (ns)
40 STT-MRAM 3
Area (mm2)

30
2
20
1
10

0.0 0
128K 512K 1M 4M 8M 16M 128K 512K 1M 4M 8M 16M

2.5 5 6
5
Write Latency (ns)

2.0 4
Read Energy (nJ)

Write Energy (nJ)

4
1.5 3
3
1.0 2
2
0.5 1 1

0.0 0 0
128K 512K 1M 4M 8M 16M 128K 512K 1M 4M 8M 16M 128K 512K 1M 4M 8M 16M

Fig. 1.5 Typical architecture-level results of performance comparison between the SRAM and
STT-MRAM caches

In the pioneering study, researchers investigated the benefits of the STT-MRAM


while replacing the SRAM in the cache [21–23]. The typical results are shown in
Fig. 1.5. The comparison between STT-MRAM and SRAM can be summarized as
follows.
(i) Compared with the SRAM, the STT-MRAM can offer higher density and
thereby the larger capacity.
(ii) The leakage power of the STT-MRAM cache is lower than that of the SRAM,
as the STT-MRAM cell stores the data in the nonvolatile way and most of the
leakage power is induced by CMOS peripheral circuits.
8 Z. Wang et al.

(iii) However, the dynamic power (especially write power) of the STT-MRAM is
much higher than that of the SRAM.
(iv) Assume that the total power in a cache is composed of leakage power and
dynamic power, the STT-MRAM cache can consume lower power than the
SRAM if the reduced leakage power is larger than the increasing dynamic
power.
(v) In addition, the write latency of the STT-MRAM is much larger than the SRAM,
which may decrease the throughput and IPC (instructions per cycle).
Considering the above points, it is not recommended to use the STT-MRAM as
L1 cache where access operations are frequently needed. By contrast, STT-MRAM
may be a preferred candidate for L2 cache or LLC, where the large capacity and low
standby power are main concerns. In particular, recently sub-5 ns write latency has
been experimentally demonstrated with the perpendicular STT-MRAM [9, 24–27],
which is sufficiently satisfying for the LLC.
From the viewpoint of application, a number of related works demonstrated that
the STT-MRAM cache is more suitable for the read-intensive application [21–23].
Moreover, the cache miss rate can be reduced due to the large-capacity advantages of
the STT-MRAM. However, the performance of the STT-MRAM cache is degraded
as the write operation becomes more frequent. The reason behind this observation is
that the write latency and write power of the STT-MRAM are worse than the SRAM,
in agreement with the above analysis. These results demonstrate that proper design
strategies are indispensable for the optimization of the STT-MRAM cache, which
will be presented in the following context.

1.3.1 Architecture-Level Optimization

In the STT-MRAM cache design, the architecture-level optimization is a very active


research area. Most researchers focus on reducing the write cost and enhancing the
read/write reliability.
(i) Reduce the write cost
First, a technique called Early Write Terminate (EWT) has been proposed [28]. It was
demonstrated that about 88% of write bits are redundant in a 16 MB STT-MRAM
L2 cache, which will incur large power waste. Therefore, the work presented in
[28] investigated a simple ‘read before write’ scheme. To write data into a cache
line, the stored data is first read and compared with the coming data. Some bits
do not need to be modified if they are identical to the coming bits. Afterwards, a
module is activated to monitor all the write processes of the L2 cache line, and a
corresponding weight coefficient is taken to represent the process. During the write
operation, the system compares the weight with a threshold. If the weight is smaller
than the threshold, the system affirms that this write operation is suspendable. Then
the write operation of corresponding redundant bits will be terminated to save the
1 Towards Spintronics Nonvolatile Caches 9

write energy. Experiments show that the EWT will reduce 32% write power in a
16 MB STT-MRAM L2 cache.
Second, the work presented in [29] deals with the so-called ‘write block’ issue in
the STT-MRAM cache. In the cache framework the read and write operations share
the same I/O port. As a result, the relatively large write latency of the STT-MRAM
will block the read requests which should be executed with a higher priority. In
[29], Obstruction-Aware Policy (OAP) was proposed to monitor the miss rate of the
L2 cache. In the L2 cache, a cache miss induces a data fetch from off-chip main
memory, which leads to a large delay of several hundred cycles. While the system is
running, the OAP will check the miss rate periodically. If the miss rate is larger than
a threshold, the write access will bypass this level cache. Instead of waiting in the
access queue, the data will be written directly to the main memory. In this term, the
write operation of the STT-MRAM does not occupy the access I/O port. As a result,
the read performance can be improved by 14% for an 8 MB shared STT-MRAM L2
cache.
Third, it is experimentally demonstrated that the critical STT current for P-to-
AP switching is intrinsically larger than that for AP-to-P switching [30]. Such an
asymmetry degrades the write performance of the STT-MRAM. To solve this issue,
an asymmetrical compensation technique was proposed [31]. If the number of the ‘0’
(AP state) in a data unit is larger than ‘1’ (P state), the data unit is flipped in order that
the AP-to-P switching occurs more frequently than P-to-AP switching. The flipped
data will be recovered at the output port. Furthermore, a uniform refresh mechanism
was proposed in [32] to alleviate the write asymmetry (see Fig. 1.6). Initially, all the

Fig. 1.6 Uniform refresh


mechanism for alleviating
the write asymmetry of the
STT-MTJ [32]
10 Z. Wang et al.

(a) Core0 Core1 Core2 Core3


(b)
IC0 DC0 IC1 DC1 IC2 DC2 IC3 DC3 Access data
Memory Controller

Memory Controller
N Write Y
intensive?

IC4 IC4 IC5 DC5 DC6 DC6 IC7 DC7 MRAM SRAM
Core4 Core5 Core6 Core7

MRAM bank SRAM bank

Fig. 1.7 a Schematic of a typical hybrid SRAM-MRAM cache architecture [34]. b In a hybrid
SRAM-MRAM cache, the write-intensive data is handled by the SRAM while the other data by the
MRAM

bits in a cache line are set to AP states (‘L’ in Fig. 1.6). Then, the coming data is
preferred to be written into this refreshed cache line. Thereby only AP-to-P switching
occurs during the write operation. With this idea, the system achieves 35% power
reduction.
Above techniques aim to optimize a fully STT-MRAM cache, where the SRAM
is absent. By contrast, some researchers acknowledge the fact that the write per-
formance of the STT-MRAM is not as good as the SRAM. They propose a new
architecture called hybrid SRAM-MRAM cache (see Fig. 1.7), where the SRAM
handles the frequently-accessed ‘hot data’ and the STT-MRAM stores another data
in a nonvolatile way. Proper management policies are required to take the advantages
of hybrid SRAM-MRAM [22, 33–36]. For example, it was suggested in [22] that the
write-intensive workloads are executed in the SRAM instead of MRAM to avoid the
large write cost. The data transfer between SRAM and MRAM causes the migration
overhead. Thereby a loop retiming technique was proposed to reduce the migration
[35].
Temperature effect of the STT-MRAM is also paid massive attention. The work
presented in [37] tried to figure out the thermal properties of a 1T1J STT-MRAM cell.
Figure 1.8 indicates that both the write energy and latency decrease rapidly as temper-
ature increases while the read ones vary slightly. In addition, in [38] the peak tempera-
ture of Intel Haswell architecture CPU is evaluated. As shown in Fig. 1.9a, the thermal
gradient can exceed 30 °C. Then, combining these points with the unique tempera-
ture properties of the STT-MRAM cell, the authors of [38] proposed a thermal-aware
cache data replacement policy, called ‘Thermosiphon’.
In ‘Thermosiphon’ policy, two counters have been introduced as shown in
Fig. 1.9b. With the different count rules, the access counter and ratio counter create a
comparison platform for read and write access weight. Then all the cache sets in the
LLC are split into different regions based on the thermal distribution (see Fig. 1.9c).
1 Towards Spintronics Nonvolatile Caches 11

(a) (b)
12.5 1.25 4 100
-25°C AP State
10.0 1.0 3 25°C 0.0

Voltage(V)
85°C Vdata0 and V ref 85°C
0.75 2 125°C Write Pulse -100
I(µA)

7.5

I(µA)
V(V)
125°C -200
5.0 0.5 1
P State -300
2.5 -25°C 0.25 0 -400
°
25 C Write current
0.0 0.0 -1 -500
3.0 3.5 4.0 4.5 0.0 2.4 4.8 7.2 9.6 12.0(ns)
Time(ns) Time(ns)
(c) (d)
12.5 1.25 4 Write current 500
AP State
10.0 1.0 3 400

Voltage(V)
85°C Vdata1 and V ref
7.5 0.75 300
I(µA)

I(µA)
2

V(V)
125°C
5.0 0.5 200
1
2.5 -25°C 0.25 Write Pulse P State 100
0.0 0.0 0 0.0
25°C
-2.5 -0.25 -1 -100
3.0 3.25 3.5 3.75 4.0 4.25 4.5 0.0 2.4 4.8 7.2 9.6 12.0(ns)
Time(ns) Time(ns)

Fig. 1.8 Effect of the temperature on the read/write performance. a and c Read latencies for ‘0’ and
‘1’, respectively, at different temperatures. b and d Write current for AP and P states, respectively,
at different temperatures [37]

(a)
(b) write access read access
332K
+1 +1
Queue,Uncore, I/O
data access cnt. ratio cnt.

Core Core
Cache block

Core Core
LLC

boundary bank
Core Core 310K (c) 3
hot region cool region
01
data 10
data 11

data 10

Core Core
data

1 2 3 4 5 6 7 8
2 1

Fig. 1.9 a Thermal map of eight-core Intel Haswell architecture b, c thermosiphon policy: the data
can be adaptively distributed to different temperature regions [38]

As discussed above, STT-MRAM cell can offer lower write power consumption and
shorter write delay under the assistance of thermal effect. Therefore, within differ-
ent temperature regions, the system could adaptively migrate write-intensive data
depending on the two pre-set counter values comparison results of each cache line
within the same cache set. In this term, unlike the widely adopted Least Recently
Used (LRU) data replacement policy, in ‘Thermosiphon’ more write-intensive data
will be migrated into the profitable (hot) region of a spatial cache set with the least
sacrifice of read performance. Compared to the conventional LRU policy which is
12 Z. Wang et al.

adopted in nonuniform cache architecture (NUCA) [39], ‘Thermosiphon’ policy can


save 22.5% write energy with negligible hardware overhead.
(ii) Enhance the read/write reliability
For the STT-MRAM, the cell reliability is threatened by the retention failure, read
disturbance and write failure [40]. Retention failure occurs when a cell is idle (a cell
that is not being read or written) and its state flips stochastically due to the thermal
activation. Read disturbance means that the data is unexpectedly changed by the
read current. A write failure occurs when the accessed MTJs are not switched by the
applied current. Towards high-reliability STT-MRAM, many efforts on circuit level
have been dedicated in the past decade [41–43]. Besides, the ECC (Error Correction
Coding) is widely accepted as an efficient method of enhancing the reliability at
architecture level [44].
The mainstream ECC schemes adopted in cache parity checking are SEC-DED
(single-error correction, double-error detection) and DEC-TED (double-error cor-
rection, triple-error detection). The SEC-DED technique relies on recursive XOR
operations on bits but it provides multiple redundant checking bits for a single bit.
Therefore, it can detect one-bit and two-bit errors, and correct one-bit error. As for the
DEC-TED technique, it can provide stronger protection with double-error correction
and triple-error detection ability, but at the expense of longer latency and hardware
overhead. Some researchers try to modify the typical ECC schemes to solve the
abovementioned three issues on the reliability of the STT-MRAM. For example,
the authors of [45] proposed ‘Sliding Basket’ policy, which classifies the cache set
into two groups and applies different ECC strengths to them (high and low). When a
write request arrives, the system will detect the data reliability class by calculating the
FLIP possibility and comparing the results with customer-specific threshold. Then,
the system will adopt different policies to deal with the High Flip Data (HFD) and
Low Flip Data (LFD) in the different ECC strength region (see Fig. 1.10). The simu-
lation results demonstrated that compared to the STT-RAM caches with conventional
ECC scheme, applying ‘Sliding Basket’ can achieve up to 80.2% saving in ECC bit
overhead, comparable write reliability and significant performance improvement.
(iii) Tools for the NV-Cache evaluation
Simulators play an essential role in evaluating the architecture-level performance
of an emerging NV-Cache, especially before this NV-Cache can be taped out. The
NVSim [46] simulator can be regarded as a pioneering effort. Based on the well-
developed memory simulator CACTI [47], NVSim provides a hardware simulation
platform for the high-performance cache and main memory. Most of emerging NVMs
including STT-MRAM, PCRAM, RRAM, and NAND-Flash can be evaluated with
NVSim.
The accuracy of NVSim model can be further improved for a specific NVM
technology. Some works have been focused on modifying a NVSim simulator and
making it more suitable for the STT-MRAM simulation [48, 49]. In [48] the authors
proposed an architecture-level cache simulation platform which integrates state-of-
the-art 40 nm perpendicular STT-MRAM technology. The simulation results show
1 Towards Spintronics Nonvolatile Caches 13

Block A: hit block Strong ECC region


Block A’: LRU block satisfies ECC requirement Weak ECC region
(a)
LFD data (FILPData B < Treshold)

write directly

Block A

(b)
LFD data (FILPData B < Treshold)
hit

Block A Block A’

write directly to weak region evict LRU block

(c) HFD data (FILPData B > Treshold)


hit

Block A’ Block A

Error
check exchange with LRU block
in strong region

Fig. 1.10 ‘Sliding Basket’ policy. Data are adaptively handled depending on the FLIP possibility
[45]

that, compared with NVSim, the proposed framework is more accurate for STT-
MRAM cache simulation.

1.3.2 Cell-Level Optimization

Power gating (PG) technique can be used for the purpose of low power consumption
[50]. As shown in Fig. 1.11, PG means that the power supply is cut off when there
is no application running for a long time, resulting in zero leakage power. Before
triggering the PG in a SRAM, the data have to be moved to the low-level cache to
avoid losing them. Once the application is restarted, those backup data need to be
recovered from the low-level cache, causing large energy overhead. This problem can
be solved by using the STT-MRAM since the PG cannot cause the data loss thanks
to the nonvolatility. Nevertheless, as mentioned above, the write speed/energy of the
STT-MRAM is poorer than that of the SRAM. Thus, the zero-leakage merit of the
STT-MRAM is not usable if the applications are running frequently.
14 Z. Wang et al.

(a) Running Long Standby Time Running

Power Application Application


SRAM active power

SRAM static power


Time

(b) Zero-standby-power by power gating


Power

SRAM active power

SRAM static power


Time
Zero-standby-power without power gating

Fig. 1.11 a Power gating (PG) technique for the SRAM. The power supply is cut off to reduce the
leakage power if there is no running application. b For the MRAM, the leakage power is nearly
zero even if the PG technique is not applied [50]

Fig. 1.12 Schematic bit-cell Leaking path


structure of the 6T2J
NV-SRAM [50] (without PG)
BL BL
WL

F P
P F

To combine the high-speed of the SRAM with the zero leakage power of the
MRAM, the hybrid NV-SRAMs cell was designed with the STT-MTJs [50–52].
Figure 1.12 shows a typical 6T2J NV-SRAM bit-cell [50], which can work in two
different modes. In the normal mode, this NV-SRAM operates as a SRAM. In the PG
mode, when the power supply is cut off, the data is stored into a couple of complemen-
tary MTJs and thus the NV-SRAM operates as a MRAM. The power consumption
can be significantly reduced by making the PG time as long as possible. The fast
write operation is guaranteed since the data is written into the SRAM in the normal
mode whereas the MTJs are only responsible for the data backup. Nevertheless, the
static power of the NV-SRAM cannot be totally eliminated as the leakage path still
exists in the normal mode (see Fig. 1.12). In addition, the NV-SRAM induces an area
penalty due to the additional peripheral circuits required by the PG technique.
1 Towards Spintronics Nonvolatile Caches 15

(a) (b)
Complementary V read = 0.4V blt sl blc
wwl
1 bit nw
H nrt nrc
AP P
P AP AP P
rwl
L
wwl
nw
nrt nrc
C BL SA
C BL AP P P AP
rwl

Fig. 1.13 a 2T2J cell enables the differential sensing and enlarges the sensing margin thanks to
the complementary design [53]. b 3T2J cell enables the simultaneous write operation [55]

To improve the read performance of the STT-MRAM cache, a 2T2J bit-cell struc-
ture was proposed as shown in Fig. 1.13a [9, 53, 54]. The complementary data is
stored into two 1T1J cells. This data is read by a current-integral sensing scheme
and differential amplification. Compared with the conventional 1T1J bit-cell, 2T2J
counterpart doubles the sensing margin and reduces the read latency. Based on this
design, a 1-Mb STT-MRAM test chip was fabricated. Evaluation results validate the
energy-efficient advantage of this 2T2J STT-MRAM cache. An improved solution
adopting a similar idea is 3T2J bit-cell shown in Fig. 1.13b [55], where an extra
transistor connects the complementary MTJs. This design enables the simultaneous
write operation and thereby decreases the cycle time and write power consumption.
An adaptive 3T3J cell structure shown in Fig. 1.14a was proposed [56]. One
3T3J cell can store 2-bit data via the resistance combinations of three MTJs. In this
structure, the left part uses 1T1J cell to store 1-bit data (Bit0), and the right one is
a 2T2J-like structure (Bit1). Two-stage sensing is adopted to read the 3T3J cell. As
can be seen in Fig. 1.14b, during the first stage the 1T1J cell is sensed and then the
2T2J cell is read during the second stage. Obviously, the 2T2J cell can be sensed
faster than 1T1J cell. Thus, with this 3T3J cell structure, the total read latency for
2-bit data is much smaller than the standard 1T1J-based cache. Moreover, this 3T3J
cell reduces the area overhead compared with the standard 2T2J cell. In the runtime
simulation, the 3T3J cell could work at 3T and 2T modes dynamically. If the running
application is space-hungry type, 3T mode will be taken to achieve the performance
and area benefits. Alternatively, the performance-demand type application activates
the 2T mode, in which the performance is comparable to that of 2T2J cell.

1.3.3 Device-Level Optimization

Nowadays it is widely accepted that the perpendicular MTJ outperforms the in-
plane MTJ due to the reasons mentioned in Sect. 1.2. Advances in nanotechnology
16 Z. Wang et al.

Vdd
(a)
Upload network

CD
CD
BL2

SA
BL0 Bit1
SA

Bit0 BL1
BL1 CD
CD Vclamp 3T-3MTJ Cell

WL WL WL

MTJ0 MTJ1 MTJ2


SE SE
Gnd

1st Stage 2nd Stage


(b) Sensing Sensing
1.25 SA2E
1.0
Voltage(V)

0.75
0.5
0.25
0.0 SA1E
-0.25 1st stage read latency=2.382ns
BL0
1.0
Voltage(V)

0.75 BL1
0.5
0.25
0.0 2nd stage read latency
-0.25 =0.395ns

1.0 BL1
Voltage(V)

0.75
0.5
0.25
0.0 BL2
0.0 5.0 10.0 15.0 20.0
Time(ns)

Fig. 1.14 a 3T3J bit-cell and corresponding periphery. b Two-stage sensing waveforms for the
3T3J bit-cell [56]

make it possible to develop the perpendicular MTJ qualified as higher level cache
(e.g. L2 or L1 cache). For that purpose, the main challenge is achieving sub-5 ns
write latency with an affordable current. Recently, sub-ns STT switching speed has
been experimentally demonstrated in an 80-nm perpendicular MTJ [26]. In addition,
sufficient TMR ratio should be guaranteed for the fast read operation. A double-
interface perpendicular MTJ with a TMR ratio as high as 249% has been recently
developed. Such a high TMR ratio was obtained by using an atom-thick tungsten
spacer to enhance the spin filtering [57]. Very recently, a perpendicular MTJ showed
the competitive features for the NV-cache [27], such as sub-20 nm size, sub-3 ns
write latency, 150% TMR ratio, <100 µA write current, sufficient read margin, etc.
1 Towards Spintronics Nonvolatile Caches 17

Under the macrospin assumption, the critical STT current (Ic ) and the data-
retention time (τ ) can be calculated by (1.4)–(1.5). Here, we take the perpendicular
MTJ for instance. Longer retention time indicates that the data is more immune to
the random bit-flips, but it also means larger write current and thereby higher power
consumption. For the storage application, the data-retention time must be more than
10 years over a wide range of temperatures (e.g. −10 to 70 ◦ C for consumer applica-
tions). This requires a thermal stability barrier of about 60−80 k B T , which, however,
is unnecessarily large for the cache application. Actually the data is maintained in
the cache for much shorter time than in the storage devices, since the data in a cache
are frequently updated (e.g. 1 s retention is sufficient for L3 cache [58]). Therefore,
the relaxation of nonvolatility was proposed for the STT-MRAM cache to decrease
the write current and write latency [59, 60].

4αe
Ic =  (1.4)
g
  
 I
τ = τ0 exp 1− (1.5)
kB T Ic

where  is the reduced Planck constant, τ0 ≈ 1 ns is the attempting time, k B is


Boltzmann constant, T is the temperature, I is the applied current.
Equations (1.4)–(1.5) show that the write current can be reduced with the sacrifice
of the retention time. Generally the thermal stability barrier is proportional to the
free layer volume, thus reducing the MTJ planar area can decrease the retention time
and the write current. In some aggressive designs, the retention time of the MTJ
is decreased to several milliseconds or microseconds. In this case, the nonvolatility
feature of the MRAM is almost discarded. Therefore, a refresh policy is required
to protect the data against the random bit-flips. Although shorter retention time can
significantly decrease the write current and write power, it requires more frequent
refresh operation which causes extra latency and energy. Excessive refresh rate can
shade the benefits of the reduced retention time. Therefore, a tradeoff between the
retention time and refresh interval need to be carefully taken into account [60–62].
For example, in [62] a volatility monitor is set to figure out whether the cache line
loses the stored data or not. Compared with the uniform refresh, this policy introduces
a dynamic methodology to control the refresh period of each cache line. According
to the experimental results, 80% of the refresh overhead can be saved.

1.4 Emerging Spintronics Technologies for the NV-Cache

In this section, we will briefly introduce two emerging spintronics technologies which
are expected to bring significant improvements over the STT-MRAM. These tech-
nologies are in the infant stage, and there is a long way before they could be used
to design the real high-performance cache. Nevertheless, some advanced researches
have been done to show the great application potential.
18 Z. Wang et al.

1.4.1 Spin Orbit Torque

Despite massive progress, STT-MRAM cache has to tolerate the following intrinsic
drawbacks. First, the read and write operations share the same path, and thus it is a
dilemma to optimize the read and write performance at the same time. Second, the
STT is expressed as (1.6). It is the thermal fluctuation that induces a small angle
between m and m p and thus triggers the STT. Therefore, an incubation delay is
caused and limits the STT switching speed. These drawbacks have to be compensated
through the cell-level or architecture-level strategies while designing a STT-MRAM
cache.
 
τ ST T ∝ λ DL JST T m × m × m p + λ F L JST T m × m p (1.6)

where JST T is the STT current density, m and m p are the unit magnetization vectors
of the free layer and pinned layer, respectively. λ DL and λ F L are the coefficients
describing the strengths of damping-like and field-like torques, respectively.
Above two drawbacks of the STT can be overcome by an emerging mechanism
called spin orbit torque (SOT) [63–66]. The schematic structure of the SOT-MTJ is
shown in Fig. 1.15, where a MTJ is deposited above a heavy-metal strip. A current
passing the heavy-metal strip can induce the SOT which drives the magnetization
switching of the free layer. The origin of the SOT may be spin Hall effect or Rashba
effect. But the quantitative ratio of these two effects cannot be easily determined.
The SOT can be expressed as

τ S O T ∝ λ DL JS O T m × (m × σ) + λ F L JS O T m × σ (1.7)

where JS O T is the SOT current density, σ is the unit polarization vector of the SOT-
induced spin injection. For the in-plane MTJ, m is nearly parallel to σ, and thus the
situation is similar to that of the STT. The incubation delay is still not eliminated.
Nevertheless, the efficiency of the SOT is much higher than that of the STT due to the

(a) (b)
Capping layers Structure inversion
Z asymmetry
Reference layer
Y MTJ Barrier
X Free layer
Heavy metal
Rashba field

Charge current

Fig. 1.15 Schematic structures of the spin orbit torque (SOT) based devices. a and b Describe the
SOT mechanisms from the viewpoints of spin Hall effect and Rashba effect, respectively [66]
1 Towards Spintronics Nonvolatile Caches 19

following reason. Consider the spin Hall effect, the ratio of spin-polarized current
and charge current can be larger than 1 by adjusting the lateral area of the heavy-
metal [see (1.8)], but it is smaller than 1 in the case of the STT-MTJ. In addition,
SOT current is allowed to be further increased to improve the switching speed, since
it passes the heavy-metal strip without the risk of barrier breakdown.

IS SM T J
= S H (1.8)
IC SH M

where I S and IC are spin current and charge current, respectively. S M T J and S H M
are the MTJ planar area and heavy-metal lateral area, respectively.  S H is the spin
Hall angle.
For the perpendicular MTJ, m is nearly vertical to σ, thus the SOT is much
stronger so that the incubation delay can be eliminated. Recently, ultrafast sub-ns
magnetization switching has been experimentally demonstrated with perpendicular
SOT-MTJ [67], which is comparable to the write speed of the SRAM. The SOT
write current density is 1011 to 1012 A/m2 . Consider a heavy-metal strip with 5 ×
40 nm2 lateral area, the SOT write current is 20−200 µA, comparable to a 40 nm
perpendicular STT-MTJ. Therefore, the perpendicular SOT-MTJ is a competitive
candidate for high-level cache (e.g. L1 cache).
Recently, some simulation studies have been done to evaluate the potential of
the SOT-MRAM cache [68, 69, 1]. Typical bit-cell array is shown in Fig. 1.16,
where each SOT-MTJ is connected with two access transistors for read and write
operations, respectively. The sensing amplifier reads the MTJ state by comparing the
read current with a reference value. The write driver generates bidirectional currents
which flow through the heavy-metal strip and switches the magnetization of the free

BL[M] SL[M] BL[M+1] SL[M+1]

RWL[N]
IRead

IWrite
WWL[N]

RWL[N+1]

WWL[N+1]

Fig. 1.16 Typical bit-cell array for the SOT-MRAM design


20 Z. Wang et al.

Fig. 1.17 Schematic structure of the NAND-like spin orbit torque device [70, 71]. The storage
density is improved by sharing the SOT current

layer. System-level simulation results demonstrate that the write performance of the
cache is significantly improved by replacing the STT-MRAM with SOT-MRAM.
However, SOT-MRAM is also bottlenecked by some shortcomings which need
to be overcome. First, SOT-MRAM cell requires two access transistors and induces
the area penalty compared to the 1T1J STT-MRAM cell. Recently NAND-like SOT
device shown in Fig. 1.17 has been proposed to alleviate this issue by sharing the
SOT current [70, 71]. Second, the heavy-metal stipe causes a loss of TMR ratio
since the read current has to pass it. Thus a higher TMR ratio is needed for the high-
speed and high-reliability read operation. Third, for the perpendicular SOT-MRAM,
an additional magnetic field is required for the deterministic switching, which pro-
hibits the practical realization of the SOT-MRAM cache. A switching mechanism
called spin-Hall-assisted STT (SHA-STT) has been proposed for the field-free write
operation of the perpendicular SOT-MRAM [72, 73]. The SOT is used to eliminate
the incubation delay while the STT determines the polarity of the write operation.
Simulation results shown in Fig. 1.18 validate the ultrafast deterministic switch-
ing induced by SHA-STT. System-level evaluation was also performed to validate
the performance improvement of the SHA-STT-MRAM cache [74], as shown in
Fig. 1.19. Recently, the interplay effect between SOT and STT has been experimen-
tally demonstrated in a three-terminal MTJ [75], which validates the feasibility of
the SHA-STT. Based on these experimental results, a novel memory called toggle
spin torques (TST) MRAM has also been proposed for upper level caches [76].
Besides, recent experiments demonstrate that the field-free SOT-induced magneti-
zation switching can be achieved with ferromagnet/antiferromagnet bilayers, where
an exchange bias is induced to replace the required magnetic field [77, 78]. This
observation offers another solution to ultrafast SOT-MRAM cache.

1.4.2 Voltage-Controlled Magnetic Anisotropy

Besides the SOT, voltage-controlled magnetic anisotropy (VCMA) is another emerg-


ing mechanism for low-power and ultrafast magnetization switching [79–83]. The
VCMA-induced magnetization switching is explained by Fig. 1.20. A voltage applied
to the MTJ causes the electron accumulation at CoFeB/MgO interface through the
electric field. As a result, the occupation of atomic orbitals is changed, which pro-
1 Towards Spintronics Nonvolatile Caches 21

(a) (b)
1 1

z 0 z 0

—1 —1
1 1
1 1
0 0
y 0 y 0
—1 —1 x —1 —1 x

Fig. 1.18 Typical trajectories of the magnetization driven by the a STT and b SHA-STT. Here the
magnetization is switched from –z to +z direction. It is clearly seen that the incubation delay can
be eliminated by the SHA-STT

SRAM STT SHA STT


(a)2 (b) >5.1 (c) 2.5
4
Read latency

Read energy
1.5
Area

1
2 1
0.5
0 0 0
K
12 K
8K

2K
1M
2M
4M
8M
M

K
12 K
8K

2K
1M
2M
4M
8M
M

K
12 K
8K

2K
1M
2M
4M
8M
M
6K

6K

6K
32
64

32
64

32
64
16

16

16
51

51

51
25

25

25
>7.1
(d) (e) 2 (f)
Leakage power

4 1
Write energy
Write latancy

1
2 0.5

0 0 0
K
12 K
8K

2K
1M
2M
4M
8M
M

K
12 K
8K

2K
1M
2M
4M
8M
M

K
12 K
8K

2K
1M
2M
4M
8M
M
6K

6K

6K
32
64

32
64

32
64
16

16

16
51

51

51
25

25

25

Fig. 1.19 Architecture-level results of performance comparison amongst the SRAM, STT-MRAM
and SHA-STT-MRAM caches [74]. Here the results are normalized to the SRAM cache

motes or represses the interfacial PMA depending on the polarity of the applied
voltage. This mechanism can be modeled by (1.9)–(1.10). From the viewpoint of
the energy barrier, a positive or negative voltage can lower or increase the energy
barrier for the magnetization switching. Two regimes can be identified depending on
the amplitude of the applied voltage: (i) if the positive voltage is sufficiently large
to fully eliminate the energy barrier, then the magnetization of the free layer will
become precessionally unstable and will walk back and forth between upwards and
downwards directions, named precessional regime. (ii) Otherwise, the energy barrier
is not fully eliminated, then thermal activation or magnetic field or STT is required to
switch the magnetization of the free layer, named thermal-activation regime. Unlike
22 Z. Wang et al.

(a) (b)
Vb < 0
VCMA-MTJ

Vb = 0

Pinned layer
“P” “AP”
Vb Tunnel barrier Vb < Vc

Free layer
Vb = Vc

Vb > Vc

Fig. 1.20 a Schematic of a VCMA-MTJ device; b illustration of the impacts of various bias voltages
on the energy barrier of a VCMA-MTJ device [83]

the STT or SOT, VCMA induces the magnetization switching through the voltage
instead of current. Therefore the write power can be significantly decreased.

μ0 Ms He f f (Vb ) K i (0) − K i (Vb ) μ0 Ms2


K e f f (Vb ) = = − (1.9)
2 tF 2
Vb
K i (Vb ) = ξ (1.10)
tox

where Vb is the applied voltage. He f f (Vb ) is the voltage-dependent effective magnetic


field, K i (Vb ) and K i (0) are the interfacial PMA energies under Vb and zero-voltage,
respectively. t F and tox are the thicknesses of the free layer and oxide barrier, respec-
tively. ξ is a linear VCMA coefficient.
Now we consider the issues of the circuit design for VCMA-based MRAM. In the
precessional regime, the perpendicular-component magnetization (m z ) periodically
oscillates between +z and –z directions under the action of a large enough voltage
(see Fig. 1.21a). To switch m z , the duration of the applied pulse needs to be accurately
controlled. Although this type of magnetization switching is ultrafast due to the high-
frequency precession, the accurate control of the pulse duration is rather difficult in
reality. Additional write-verify operation is required for avoiding the write error. In
the thermal-activation regime, the applied voltage just disturbs the magnetization,
and an additional magnetic field or current is required for the deterministic switching.
A feasible design is that, a voltage is first applied to induce the VCMA (STT plays
nondominant role in this case) and then followed by a second pulse inducing the STT
(VCMA plays nondominant role in this case). This scheme is called STT-assisted
VCMA [84] (see Fig. 1.21b).
Similar to the STT-MTJ, the 1T1J cell structure can be used to construct the
VCMA-MRAM cache. Architecture-level evaluation results [83] demonstrate that
the write energy of the VCMA-MRAM is much lower than the others. Moreover,
1 Towards Spintronics Nonvolatile Caches 23

(a) (b)
1 1
P AP
1.2 V 1.2V
0.8V
tb STT-assisted VCMA
0 t b = 0.20 ns
tb
mz

mz
0 AP P t b = 0.30 ns

t b = 0.3 ns 1.2V t b = 0.40 ns


t b = 1.0 ns t b = 0.20 ns
t b = 3.0 ns t b = 0.30 ns
tb
t b = 0.40 ns
Precessional
VCMA -0.8V
-1 -1
0 1.0 2.0 3.0 4.0 0 1.0 2.0 3.0 4.0
Time (ns) Time (ns)

Fig. 1.21 Time-resolved evolutions of the magnetization of the free layer in the presence of the
VCMA effect [83]. The VCMA-MTJ switching operates in the a precessional regime; and b thermal-
activation regime with STT assistance

the write latency of the VCMA-MRAM is competitive to replace the SRAM. Like
the other types of the MRAM, the advantages in the area, read energy/latency, and
leakage energy over the SRAM is also kept by the VCMA-MRAM.

1.5 Summaries and Perspectives

In this chapter, we reviewed the efforts towards the MRAM-based NV cache. Among
the various types of MRAMs, the STT-MRAM is attracting much more research
interests than others. Both commercial products and experimental prototypes of the
STT-MRAMs have been demonstrated. Meanwhile, both standalone and embed-
ded applications with the STT-MRAM have been explored. These advancements
encourage the researchers to develop the STT-MRAM-based cache. However, this
goal is blocked by a fact that the write performance of the STT-MRAM is poorer
compared with the conventional SRAM. For overcoming this weakness, a number
of researchers who may be physicist, electronics scientists, or computer experts,
proposed massive optimization strategies at device-level, cell-level, circuit-level or
architecture-level. These works were summarized in the main body of this chapter.
Besides, another route for high-performance MRAM cache is to revolutionize the
mechanism of the magnetization switching. Recently proposed SOT and VCMA
have shown promising potential for high-speed low-power MRAM. Nevertheless,
many intrinsic difficulties at the device level need to be solved before they can be
used to design the NV-Cache. In addition, other spintronics concepts such as domain-
wall racetrack memory and skyrmions have also been attempted in the design of the
NV-Cache [85–90], although they are not covered by this chapter. We believe that
24 Z. Wang et al.

the above technologies will coexist for a long time during the exploration of the
MRAM-based cache.

Acknowledgements This work was supported by the National Natural Science Foundation of
China (61704005, 61501013 and 61571023), the National Key Technology Program of China
(2017ZX01032101), and the International Mobility Project (B16001 and 2015DFE12880).

References

1. G. Prenat, K. Jabeur, P. Vanhauwaert, G. Pendina, F. Oboril, R. Bishnoi, M. Ebrahimi, N.


Lamard, O. Boulle, K. Garello, J. Langer, B. Ocker, M. Cyrille, P. Gambardella, M. Tahoori,
G. Gaudin, Ultra-fast and high-reliability SOT-MRAM: from cache replacement to normally-
off computing. IEEE Trans. Multi-Scale Comput. Syst. 2(1), 49–60 (2016)
2. Inside NAND Flash Memories (Springer, Dordrecht, The Netherlands, 2010)
3. H. Wong, S. Raoux, S. Kim, J. Liang, J. Reifenberg, B. Rajendran, M. Asheghi, K. Goodson,
Phase change memory. Proc. IEEE 98(12), 2201–2227 (2010)
4. M. Qazi, M. Clinton, S. Bartling, A. Chandrakasan, A low-voltage 1 Mb FRAM in 0.13 µm
CMOS featuring time-to-digital sensing for expanded operating margin. IEEE J. Solid-State
Circuits 47(1), 141–150 (2012)
5. D. Apalkov, B. Dieny, J. Slaughter, Magnetoresistive random access memory. Proc. IEEE
104(10), 1796–1830 (2016)
6. S. Bhatti, R. Sbiaa, A. Hirohata, H. Ohno, S. Fukami, S. Piramanayagam, Spintronics based
random access memory: a review. Mater. Today 20(9), 530–548 (2017)
7. H. Akinaga, H. Shima, Resistive random access memory (ReRAM) based on metal oxides.
Proc. IEEE 98(12), 2237–2251 (2010)
8. H. Noguchi, et al., 4 MB STT-MRAM-based cache with memory-access-aware power optimiza-
tion and write-verify-write/read-modify-write scheme, in IEEE-ISSCC (2016), pp. 132–133
9. H. Noguchi, et al., 7.5 A 3.3 ns-access-time 71.2 µW/MHz 1 Mb embedded STT-MRAM
using physically eliminated read-disturb scheme and normally-off memory architecture, in
IEEE-ISSCC (2015), pp. 1–3
10. A.D. Kent, D. Worledge, A new spin on magnetic memories. Nat. Nanotechnol. 10(3), 187–191
(2015)
11. B. Engel, J. Akerman, B. Butcher, R. Dave, M. DeHerrera, M. Durlam, G. Grynkewich, J.
Janesky, S. Pietambaram, N. Rizzo, J. Slaughter, K. Smith, J. Sun, S. Tehrani, A 4-Mb toggle
MRAM based on a novel bit and switching method. IEEE Trans. Magn. 41(1), 132–136 (2005)
12. L. Berger, Emission of spin waves by a magnetic multilayer traversed by a current. Phys. Rev.
B 54(13), 9353–9358 (1996)
13. J. Slonczewski, Current-driven excitation of magnetic multilayers. J. Magn. Magn. Mater.
159(1–2), L1–L7 (1996)
14. Y. Huai, F. Albert, P. Nguyen, M. Pakala, T. Valet, Observation of spin-transfer switching in
deep submicron-sized and low-resistance magnetic tunnel junctions. Appl. Phys. Lett. 84(16),
3118–3120 (2004)
15. Everspin Technologies
16. W. Zhao, C. Chappert, V. Javerliac, J. Noziere, High speed, high stability and low power sensing
amplifier for MTJ/CMOS hybrid logic circuits. IEEE Trans. Magn. 45(10), 3784–3787 (2009)
17. Y. Chen, H. Li, X. Wang, W. Zhu, W. Xu, T. Zhang, A 130 nm 1.2 V/3.3 V 16 Kb spin-transfer
torque random access memory with nondestructive self-reference sensing scheme. IEEE J.
Solid-State Circuits 47(2), 560–573 (2012)
18. W. Kang, L. Zhang, J.O. Klein, Y. Zhang, D.R. Ravolosona, W. Zhao, Reconfigurable codesign
of STT-MRAM under process variations in deeply scaled technology. IEEE Trans. Electron
Devices 62(6), 1769–1777 (2015)
1 Towards Spintronics Nonvolatile Caches 25

19. S. Ikeda, K. Miura, H. Yamamoto, K. Mizunuma, H.D. Gan, M. Endo, S. Kanai, J. Hayakawa,
F. Matsukura, H. Ohno, A perpendicular-anisotropy CoFeB–MgO magnetic tunnel junction.
Nat. Mater. 9(9), 721–724 (2010)
20. M. Hosomi, et al., A novel nonvolatile memory with spin torque transfer magnetization switch-
ing: spin-RAM, in IEEE-IEDM (2005), pp. 459–462
21. A. Maashri, G. Sun, X. Dong, V. Narayanan, Y. Xie, 3D GPU architecture using cache stacking:
Performance, cost, power and thermal analysis, in IEEE-ICCD (2009), pp. 254–259
22. G. Sun, X. Dong, Y. Xie, J. Li, Y. Chen, A novel architecture of the 3D stacked MRAM L2
cache for CMPs, in IEEE-HPCA (2009), pp. 239–249
23. X. Dong, et al., Circuit and microarchitecture evaluation of 3D stacking magnetic RAM
(MRAM) as a universal memory replacement, in ACM/IEEE DAC (2008), pp. 554–559
24. G. Jan, et al., Demonstration of fully functional 8 Mb perpendicular STT-MRAM chips with
sub-5 ns writing for non-volatile embedded memories, in IEEE Symposium on VLSI Technology
(2014), pp. 42–43
25. D. Saida, et al., Sub-3 ns pulse with sub-100 µA switching of 1x–2x nm perpendicular MTJ
for high-performance embedded STT-MRAM towards sub-20 nm CMOS, in IEEE Symposium
on VLSI Technology (2016), pp. 1–2
26. G. Jan, et al., Achieving sub-ns switching of STT-MRAM for future embedded LLC appli-
cations through improvement of nucleation and propagation switching mechanisms, in IEEE
Symposium on VLSI Technology (2016), pp. 1–2
27. D. Saida, S. Kashiwada, M. Yakabe, T. Daibou, M. Fukumoto, S. Miwa, Y. Suzuki, K. Abe,
H. Noguchi, J. Ito, S. Fujita, 1x–2x nm perpendicular MTJ switching at sub-3-ns pulses below
100 µA for high-performance embedded STT-MRAM for sub-20-nm CMOS. IEEE Trans.
Electron Devices 64(2), 427–431 (2017)
28. P. Zhou, B. Zhao, J. Yang, Y. Zhang, Energy reduction for STT-RAM using early write termi-
nation, in IEEE/ACM ICCAD (2009), pp. 264–268
29. J. Wang, X. Dong, Y. Xie, OAP: an obstruction-aware cache management policy for STT-RAM
last-level caches, in DATE (2013), pp. 847–852
30. C.J. Lin, et al., 45 nm low power CMOS logic compatible embedded STT MRAM utilizing a
reverse-connection 1T/1MTJ cell, in IEEE-IEDM (2009), pp. 1–4
31. K. Ikegami, et al., Low power and high density STT-MRAM for embedded cache memory
using advanced perpendicular MTJ integrations and asymmetric compensation techniques, in
IEEE-IEDM (2014), pp. 28.1.1–28.1.4
32. G. Sun, Y. Zhang, Y. Wang, Y. Chen, Improving energy efficiency of write-asymmetric mem-
ories by log style write, in ISLPED (2012), pp. 173–178
33. X. Wu, J. Li, L. Zhang, E. Speight, Y. Xie, Power and performance of read-write aware hybrid
caches with non-volatile memories, in DATE (2009), pp. 737–742
34. J. Li, C. Xue, Y. Xu, STT-RAM based energy-efficiency hybrid cache for CMPs, in IEEE/IFIP
VLSI-SoC (2011), pp. 31–36
35. K. Qiu, M. Zhao, Q. Li, C. Fu, C. Xue, Migration-aware loop retiming for STT-RAM-based
hybrid cache in embedded systems. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
33(3), 329–342 (2014)
36. A. Sharifi, M. Kandemir, Automatic feedback control of shared hybrid caches in 3D chip
multiprocessors, in International Euromicro Conference on PDP (2011), pp. 393–400
37. B. Wu, Y. Cheng, J. Yang, A. Todri-Sanial, W. Zhao, Temperature impact analysis and access
reliability enhancement for 1T1MTJ STT-RAM. IEEE Trans. Reliab. 65(4), 1755–1768 (2016)
38. B. Wu, et al., Thermosiphon: a thermal aware NUCA architecture for write energy reduction
of the STT-MRAM based LLCs, in IEEE/ACM ICCAD (2017), pp. 474–481
39. C. Kim, D. Burger, S.W. Keckler, An Adaptive, non-uniform cache structure for wire-delay
dominated on-chip caches, in ACM-ASPLOS (2002), pp. 211–222
40. W. Zhao et al., Failure and reliability analysis of STT-MRAM. Microelectron. Reliab. 52(9–10),
1848–1852 (2011)
41. D. Zhang, L. Zeng, T. Gao, F. Gong, X. Qin, W. Kang, Y. Zhang, Y. Zhang, J. Klein, W.
Zhao, Reliability-enhanced separated pre-charge sensing amplifier for hybrid CMOS/MTJ
logic circuits. IEEE Trans. Magn. 53(9), 1–5 (2017)
26 Z. Wang et al.

42. H. Zhang, W. Kang, T. Pang, W. Lv, Y. Zhang, W. Zhao, Dual reference sensing scheme with
triple steady states for deeply scaled STT-MRAM, in IEEE/ACM NANOARCH (2016), pp. 1–6
43. L. Zhang, et al., Channel modeling and reliability enhancement design techniques for STT-
MRAM, in ISVLSI (2015), pp. 461–466
44. M. McCartney, SRAM reliability improvement using ECC and circuit techniques, Ph.D. thesis
(2014)
45. X. Wang, M. Mao, E. Eken, W. Wen, H. Li, Y. Chen, Sliding basket: an adaptive ECC scheme
for runtime write failure suppression of STT-RAM cache, in DATE (2016), pp. 762–767
46. X. Dong, C. Xu, Y. Xie, N. Jouppi, NVSim: a circuit-level performance, energy, and area
model for emerging nonvolatile memory. IEEE Trans. Comput. Aided Des. Integr. Circuits
Syst. 31(7), 994–1007 (2012)
47. S. Wilton, N. Jouppi, CACTI: an enhanced cache access and cycle time model. IEEE J. Solid-
State Circuits 31(5), 677–688 (1996)
48. B. Wu, et al., An architecture-level cache simulation framework supporting advanced PMA
STT-MRAM, in IEEE/ACM NANOARCH (2015), pp. 7–12
49. E. Eken, et al., NVSim-VXs: an improved NVSim for variation aware STT-RAM simulation,
in ACM/EDAC/IEEE-DAC (2016), pp. 1–6
50. K. Abe, et al., Novel hybrid DRAM/MRAM design for reducing power of high performance
mobile CPU, in IEEE-IEDM (2012), pp. 10.5.1–10.5.4
51. S. Yamamoto, S. Sugahara, Nonvolatile static random access memory using magnetic tunnel
junctions with current-induced magnetization switching architecture. Jpn. J. Appl. Phys. 48(4),
043001 (2009)
52. T. Ohsawa, et al., 1 Mb 4T-2MTJ nonvolatile STT-RAM for embedded memories using 32b
fine-grained power gating technique with 1.0 ns/200 ps wake-up/power-off times, in Symposium
on VLSIC (2012), pp. 46–47
53. H. Noguchi, et al., A 250-MHz 256b-I/O 1-Mb STT-MRAM with advanced perpendicular
MTJ based dual cell for nonvolatile magnetic caches to reduce active power of processors, in
Symposium on VLSI Technology (2013), pp. 108–109
54. H. Noguchi, et al., Highly reliable and low-power nonvolatile cache memory with advanced per-
pendicular STT-MRAM for high-performance CPU, in Symposium on VLSIC (2014), pp. 1–2
55. A. Kawasumi, et al., Circuit techniques in realizing voltage-generator-less STT MRAM suitable
for normally-off-type non-volatile L2 cache memory, in IEEE-IMW (2013), pp. 76–79
56. L. Xue, B. Wu, B. Zhang, Y. Cheng, P. Wang, C. Park, J. Kan, S. Kang, Y. Xie, An adaptive
3T-3MTJ memory cell design for STT-MRAM-based LLCs. IEEE Trans. Very Large Scale
Integr. (VLSI) Syst. 26(3), 484–495 (2018)
57. M. Wang, W. Cai, K. Cao, J. Zhou, J. Wrona, S. Peng, H. Yang, J. Wei, W. Kang, Y. Zhang, J.
Langer, B. Ocker, A. Fert, W. Zhao, Current-induced magnetization switching in atom-thick
tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresis-
tance. Nat. Commun. 9(1) (2018)
58. K. Ikegami, et al., MTJ-based ‘normally-off processors’ with thermal stability factor engineered
perpendicular MTJ L2 cache based on 2T-2MTJ cell L3 and last level cache based on 1T-1MTJ
cell and novel error handling scheme, in IEEE-IEDM (2015), pp. 25.1.1–25.1.4
59. C.W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, M.R. Stan, Relaxing non-volatility for
fast and energy-efficient STT-RAM caches, in IEEE-HPCA (2011), pp. 50–61
60. H. Li, X. Wang, Z. Ong, W. Wong, Y. Zhang, P. Wang, Y. Chen, Performance, power, and
reliability tradeoffs of STT-RAM cell subject to architecture-level requirement. IEEE Trans.
Magn. 47(10), 2356–2359 (2011)
61. A. Jog, et al., Cache revive: architecting volatile STT-RAM caches for enhanced performance
in CMPs, in DAC (2012), pp. 243–252
62. Z. Sun, et al., Multi retention level STT-RAM cache designs with a dynamic refresh scheme,
in IEEE/ACM MICRO (2011), pp. 329–338
63. I. Miron, K. Garello, G. Gaudin, P. Zermatten, M. Costache, S. Auffret, S. Bandiera, B. Rod-
macq, A. Schuhl, P. Gambardella, Perpendicular switching of a single ferromagnetic layer
induced by in-plane current injection. Nature 476(7359), 189–193 (2011)
1 Towards Spintronics Nonvolatile Caches 27

64. L. Liu, C. Pai, Y. Li, H. Tseng, D. Ralph, R. Buhrman, Spin-torque switching with the giant
spin Hall effect of tantalum. Science 336(6081), 555–558 (2012)
65. M. Cubukcu, O. Boulle, M. Drouard, K. Garello, C. Onur Avci, I. Mihai Miron, J. Langer,
B. Ocker, P. Gambardella, G. Gaudin, Spin-orbit torque magnetization switching of a three-
terminal perpendicular magnetic tunnel junction. Appl. Phys. Lett. 104(4), 042406 (2014)
66. Z. Wang, Z. Li, Y. Liu, S. Li, L. Chang, W. Kang, Y. Zhang, W. Zhao, Progresses and challenges
of spin orbit torque driven magnetization switching and application, in IEEE-ISCAS (2018)
67. M. Cubukcu, O. Boulle, N. Mikuszeit, C. Hamelin, T. Bracher, N. Lamard, M. Cyrille, L.
Buda-Prejbeanu, K. Garello, I. Miron, O. Klein, G. de Loubens, V. Naletov, J. Langer, B.
Ocker, P. Gambardella, G. Gaudin, Ultra-fast perpendicular spin-orbit torque MRAM. IEEE
Trans. Magn. 54(4), 1–4 (2018)
68. J. Kim, et al., Spin-Hall effect MRAM based cache memory: a feasibility study, in DRC (2015),
pp. 117–118
69. R. Bishnoi, M. Ebrahimi, F. Oboril, M.B. Tahoori, Architectural aspects in design and analysis
of SOT-based memories, in ASP-DAC (2014), pp. 700–707
70. Z. Wang, L. Zhang, M. Wang, Z. Wang, D. Zhu, Y. Zhang, W. Zhao, High-density NAND-like
spin transfer torque memory with spin orbit torque erase operation. IEEE Electron Device Lett.
39(3), 343–346 (2018)
71. H. Yoda, et al., Voltage-control spintronics memory (VoCSM) having potentials of ultra-low
energy-consumption and high-density, in IEEE-IEDM (2016), pp. 27.6.1–27.6.4
72. Z. Wang, W. Zhao, E. Deng, J. Klein, C. Chappert, Perpendicular-anisotropy magnetic tunnel
junction switched by spin-Hall-assisted spin-transfer torque. J. Phys. D Appl. Phys. 48(6),
045001 (2015)
73. A. van den Brink, S. Cosemans, S. Cornelissen, M. Manfrini, A. Vaysset, W. Van Roy, T. Min,
H. Swagten, B. Koopmans, Spin-Hall-assisted magnetic random access memory. Appl. Phys.
Lett. 104(1), 012403 (2014)
74. L. Chang, et al., Evaluation of spin-Hall-assisted STT-MRAM for cache replacement, in
IEEE/ACM NANOARCH (2016), pp. 73–78
75. M. Wang et al., Field-free switching of a perpendicular magnetic tunnel junction through the
interplay of spin–orbit and spin-transfer torques. Nat. Electron. 1, 582–588 (2018)
76. Z. Wang et al., Proposal of Toggle Spin Torques Magnetic RAM for Ultrafast Computing.
IEEE Electron Device Lett 40(5), 726–729 (2019)
77. S. Fukami, C. Zhang, S. DuttaGupta, A. Kurenkov, H. Ohno, Magnetization switching
by spin–orbit torque in an antiferromagnet–ferromagnet bilayer system. Nat. Mater. 15(5),
535–541 (2016)
78. Y. Oh, S. Chris Baek, Y. Kim, H. Lee, K. Lee, C. Yang, E. Park, K. Lee, K. Kim, G. Go, J.
Jeong, B. Min, H. Lee, K. Lee, B. Park, Field-free switching of perpendicular magnetization
through spin–orbit torque in antiferromagnet/ferromagnet/oxide structures. Nat. Nanotechnol.
11(10), 878–884 (2016)
79. W. Wang, M. Li, S. Hageman, C. Chien, Electric-field-assisted switching in magnetic tunnel
junctions. Nat. Mater. 11(1), 64–68 (2011)
80. J.G. Alzate, et al., Voltage-induced switching of nanoscale magnetic tunnel junctions, in IEEE-
IEDM (2012), pp. 29.5.1–29.5.4
81. K. Wang, H. Lee, P.Khalili Amiri, Magnetoelectric random access memory-based circuit design
by using voltage-controlled magnetic anisotropy in magnetic tunnel junctions. IEEE Trans.
Nanotechnol. 14(6), 992–997 (2015)
82. W. Kang, Y. Ran, Y. Zhang, W. Lv, W. Zhao, Modeling and exploration of the voltage-controlled
magnetic anisotropy effect for the next-generation low-power and high-speed MRAM appli-
cations. IEEE Trans. Nanotechnol. 16(3), 387–395 (2017)
83. W. Kang, L. Chang, Y. Zhang, W. Zhao, Voltage-controlled MRAM for working memory:
perspectives and challenges, in DATE (2017), pp. 542–547
84. S. Kanai, Y. Nakatani, M. Yamanouchi, S. Ikeda, H. Sato, F. Matsukura, H. Ohno, Magnetization
switching in a CoFeB/MgO magnetic tunnel junction by combining spin-transfer torque and
electric field-effect. Appl. Phys. Lett. 104(21), 212406 (2014)
28 Z. Wang et al.

85. H. Xu, Y. Li, R. Melhem, A.K. Jones, Multilane racetrack caches: improving efficiency through
compression and independent shifting, in ASP-DAC (2015), pp. 417–422
86. X. Zhang, L. Zhao, Y. Zhang, J. Yang, Exploit common source-line to construct energy efficient
domain wall memory based caches, in IEEE-ICCD (2015), pp. 157–163
87. R. Venkatesan et al., Cache design with domain wall memory. IEEE Trans. Comput. 65(4),
1010–1024 (2016)
88. W. Kang, C. Zheng, Y. Huang, X. Zhang, W. Lv, Y. Zhou, W. Zhao, Compact modeling and
evaluation of magnetic skyrmion-based racetrack memory. IEEE Trans. Electron Devices 64(3),
1060–1068 (2017)
89. W. Kang, Y. Huang, X. Zhang, Y. Zhou, W. Zhao, Skyrmion-electronics: an overview and
outlook. Proc. IEEE 104(10), 2040–2061 (2016)
90. F. Chen et al. Process variation aware data management for magnetic skyrmions racetrack
memory, in ASP-DAC (2018), pp. 221–226
Chapter 2
CMOS-OxRAM Based Hybrid
Nonvolatile SRAM and Flip-Flop:
Circuit Implementations

Swatilekha Majumdar, Sandeep Kaur Kingra and Manan Suri

Abstract A critical technological challenge over the past few decades has been
to achieve low-power operation without sacrificing performance. This led to the
development of computing units that can normally be turned off when not in use
and turned on instantly with full performance, when required thereby helping in
eliminating leakage power. However, with direct power-down, the states in local
memories (SRAM) and volatile registers (SRAM-based flip-flop) will be lost. Thus,
to have a power-down mode in SRAM-based memories and Flip-Flops (FFs), the
data states are off-loaded to an external nonvolatile storage array, thus giving rise
to NV-SRAM/NV-FF circuits (i.e. nonvolatile SRAM/nonvolatile Flip-flop). In this
chapter, we present a real-time 4T-2R NV-SRAM bitcell using HfOx based OxRAM
(oxide based random access memory) devices. We will discuss the working principle,
programming methodologies and the stability of NV-SRAM bitcell. We will further
present a novel NV-FF design based on 4T-2R NV-SRAM bitcell and will provide
an insight into its working and operating modes.

2.1 Introduction

With the advent of technologies like wireless sensors, bio-medical implants and
internet-of-things (IoTs), ultra low-power operation and “normally-off instant power-
on” mode have become an absolute necessity [1–3]. These systems have spo-
radic wake-up times, and thus the leakage power is a dominant phenomena in the
power consumption for such systems. To minimize the leakage power, power gat-
ing approach has been proposed, where a lower voltage (hold voltage) is used for

S. Majumdar · S. K. Kingra · M. Suri (B)


Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, Delhi, India
e-mail: [email protected]
S. Majumdar
e-mail: [email protected]
S. K. Kingra
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2020 29


M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_2
30 S. Majumdar et al.

volatile memory to retain data while all logic circuits are turned off [4]. However, even
maintenance of this hold voltage (during power-down mode) in high-performance
processing units, leads to a huge power dissipation due to leakage current, which
is ≈40% of the dynamic energy [5]. Even worse, during abrupt power failures the
data in volatile memory is lost and computation tasks have to be restarted. This hap-
pens due to the volatile nature of CMOS memory cells used in conventional CPUs
such as SRAM-based caches and flip-flop (FF) based register files. To mitigate these
issues, different circuits have been designed to back-up data from on-chip memory
(SRAM), FFs and registers to off-chip nonvolatile memory (NVM) thus preserving
the system state in case of power failures. This is known as two-macro scheme, i.e.
SRAM (for faster access) in conjunction with NVM (for nonvolatility). However, the
main drawback of this methodology is that it requires long store/restore time due to
serial SRAM read/write and long NVM write/read procedures. This results in long
power-on/off time. Thus, the two-macro scheme is vulnerable to data loss in case of
sudden power failure [6, 7]. To address these limitations, NVM elements are directly
integrated to SRAM or FF units, where it forms a direct bit-to-bit connection in a
vertical arrangement to achieve faster parallel data transfer and turn on/off speed.
This gives rise to NV-SRAM/NV-FF units.
Emerging NVMs such as floating-gate based memories, PCM (Phase Change
Memory), FRAM (Ferroelectric RAM), OxRAM (Oxide-based RAM), CBRAM
(Conductive Bridge RAM) and STT-MRAM (Spin Transfer Torque based Mag-
netoresistive RAM) have emerged as promising solutions for realizing embedded
nonvolatile Logic. However, due to large access/programming times, high operating
voltages and limited endurance, floating gate or FLASH memories are less favored
choices. PCM devices, on the other hand, requires large current to heat the GST mate-
rial for resistive switching between crystalline and amorphous states. FRAM poses a
number of challenges owing to data signal degradation in the scaling of devices. STT-
MRAM also need large programming currents to exert a spin torque on the magnetic
moment of the free layer with respect to the fixed layer and hence leads to higher
power dissipation during the programming phase. As a result, OxRAM devices have
emerged as a great choice for hybrid CMOS-NVM based nonvolatile circuits owing to
their low cost, high density, low operating voltages, negligible leakage, access times
about 1000× faster than floating-gate memories, full CMOS compatibility, possibil-
ity of 3D integration and integration in vias [8–11]. In this chapter, we are presenting
the most important counterparts (CMOS-OxRAM) of conventional volatile mem-
ory systems, i.e. (i) NV-SRAM, and (ii) NV-FF. These hybrid nonvolatile circuits
offer advantages like: (i) nearly zero leakage, (ii) efficient backup/restore operation
and (iii) high performance and low energy. We have presented 4T-2R NV-SRAM
bitcell that offers “real-time nonvolatility”. Using this 4T-2R NV-SRAM bitcell, we
have proposed a novel NV-FF design. This chapter is organized as follows: Sect. 2.2
summarizes the different NV-SRAM/NV-FF implementations proposed in literature
so far. Section 2.3 discusses our 4T-2R NV-SRAM bitcell and explains its different
programming schemes. Section 2.4 shows our novel real-time NV-FF implementa-
tion using 4T-2R NV-SRAM bitcell and presents its operating modes. We have also
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 31

presented a modified NV-FF design that offers better system performance compared
to the aforementioned NV-FF design. The chapter concludes with Sect. 2.5.

2.2 Prior Art: NV-SRAM and NV-FF

This section is devoted to the overview of developments in NV-SRAM/NV-FF circuit


designs using the emerging NVM memory technologies:

2.2.1 Nonvolatile Static Random Access Memory

Memory architecture use hierarchy of caches (L1, L2, last level cache (LLC), etc.)
and the optimization target for designing each cache level is different. L1 is accessed
quite frequently and therefore, it needs higher speed and write endurance whereas
LLC is targeted to minimize off-chip accesses and thereby needs large capacity.
Hence, it is recommended to have SRAM-based L1 cache for better performance
[12] whereas emerging NVMs can be used in L2 or LLC (due to their latency,
density and write endurance values). To realize a nonvolatile cache using NVM,
researchers have proposed various bitcell based optimization schemes. It is proposed
to use NV-SRAM (including a volatile- and nonvolatile circuit) for nonvolatile cache
implementation. Under normal operations (when external power is supplied), the
volatile circuit provides fast data access. When controlled power-down/sleep-mode is
enabled or there is sudden power failure, the nonvolatile circuit provides data backup,
thereby retaining data previously stored in the volatile circuit. In literature, several
different hybrid (CMOS-OxRAM/CMOS-MTJ (magnetic tunnel junction)/CMOS-
PCM) NV-SRAM designs like 9T-2R [13], 8T-2R [7, 14, 15], 8T-2MTJ [16], 8T-1R
[17], 7T-2R [18, 19], 7T-1R [20, 21], 6T-2R [22], 6T-2MTJ [23], 4T-2R [24, 25] and
4T-2MTJ [26, 27] have been proposed. Figure 2.1 shows the circuit schematic for dif-
ferent NV-SRAM implementations. These implementations differ in their approach
to store data during power-down mode. Xue et al. [13] proposed 9T-2R NV-SRAM
bitcell where they used equalization transistor connected between the storage nodes
for data restoring mode. However, the area requirements for 9T-2R is ≈230F2 com-
pared to ≈140F2 for conventional 6T SRAM. Furthermore, separate wordlines (WLs)
are required for the storage nodes that increases the count of control signals leading to
routing congestion. Chiu et al. [7, 14] proposed 8T-2R bitcell for better density com-
pared to 9T-2R. This bitcell offered BL-CL (Bitline-Control line) sharing scheme to
reduce area overhead and also enabled write-assist function. However, the drawback
with this implementation is the requirement of extra control lines for off-loading
the data for power-down mode. Moving ahead, to minimize the leakage currents,
Tasson et al. [17] proposed 8T-1R NV-SRAM bitcell. The restore time for 8T-1R is
≈2.6× compared to 8T-2R [7] due to multiple steps involved in operation and also
its read latency is higher than 9T-2R [13], 6T-2R [22] and conventional 6T SRAM.
32 S. Majumdar et al.

Fig. 2.1 Circuit schematics of different NV-SRAM bitcells proposed in literature: a 9T-2R: WLL
and WLR are separate WLs to control 1T-1R cells, b 8T-2R: SWL indicates NVM switch line,
c 8T-1R, d 7T-2R, e 7T-1R, and f 6T-2R (redrawn from [13, 14, 17, 18, 20, 22]). Variable resistance
here indicates NVM element

By using 1T-2R as off-loading storage element with conventional 6T SRAM cell, a


7T-2R bitcell was proposed by Sheu et al. [18]. Using this bitcell, the write margin
improved by 1.03× and 1.37× when compared to 6T SRAM and 6T-2R [22] bitcell
respectively. However, when compared to 8T-2R [7], write margin and read stability
is degraded. Furthermore, the area of 7T-2R bitcell is 1.07× more than 6T-2R bitcell.
NVM elements used in implementations [7, 13–23] are exploited such that,
they store the NV-SRAM state only during controlled power-down or sleep-mode,
enabling only a ‘last-bit non volatility’ whereas in [24–27] offers ‘real-time non-
volatility’ as NVM devices participate actively during bitcell programming. In this
chapter, we will discuss our 4T-2R NV-SRAM work [24, 25] that offers ‘real-time
nonvolatility’ and will summarize its different programming schemes.

2.2.2 Nonvolatile Flip-Flops

Several NV-FFs have been proposed over time using emerging NVM devices such
as OxRAM [5, 27–31], MTJ [32–42], ferroelectric capacitors [43–45] and transis-
tors [6, 46]. These flip-flops provide on-demand and controlled data backup and
restore whenever appropriate backup signal is triggered. However, having additional
circuit as an off-loading data block leads to area and power overheads. Therefore,
the major challenge in designing NV-FF lies in coming up with an area efficient cir-
cuit design along with high performance in terms of speed, power and energy. A lot of
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 33

Fig. 2.2 Different NV-FF schematics proposed in literature a OxRAM-based NV-FF [5] b STT-
MTJ-based NV-FF [34] c SHE-MTJ-based NV-FF [35] d Ferroelectric capacitors-based NV-FF
[46]. The figures have been redrawn from the referenced papers

developmental work has been done in designing and optimization of NV-FF. Figure 2.2
shows some of the NV-FF designs as proposed in literature. Iyenger et al. [3] pro-
posed a MTJ-based NV-FF with enhanced scan capability in two variants—Enhanced
Scan Enabled NV-FF (ES NV-FF) and High Performance ES NV-FF (HPES NV-FF).
In ES NV-FF, two parallel latches allowed enhanced scan and store-restore opera-
tions. The output of the master latch was connected to the slave latch as well as
the NV latch. The two MTJ devices are written serially during negative pulse of
the clock cycle thus limited the operating frequency of the FF. In HPES NV-FF, the
MTJ devices were written in parallel thus, the frequency of FF is not compromised.
The authors also analyzed that the cell area of ES NV-FF was ≈1.8× compared to
standard master–slave FF (MSFF) and gave a maximum frequency of 2 GHz. HPES
NV-FF had an area overhead of ≈2.5× that of MSFF with 2 GHz operating range.
In [5], a bipolar OxRAM-based NV-FF was proposed. The off-loading NVM cir-
cuit was connected to the slave part of the FF element, comprised of two OxRAM
devices whose operational modes were controlled by a group of transistors called the
NVM-L and NVM-R. Each NVM block was a 3T-1R structure that contributed in
controlling and providing current compliance to the circuit. Authors claimed that the
circuit has zero standby-leakage power and nonvolatility, at an area overhead of only
25% as compared to Balloon FF solution [47] and a 10% increase in CLK-Q delay
compared to a normal FF delay. In [33] and [48], two MTJ devices were used for off-
34 S. Majumdar et al.

loading the data from MSFF and the MTJ devices retained the off-loaded state only
during the sleep-mode. In these designs, the MTJ states were updated on every clock
cycle, which increased the power consumption, reduced the FF speed and endurance
of MTJ. Furthermore, Jung et al. [48] aimed to minimize short circuit current by
using low-skewed NAND (LS-NAND), which was used to efficiently interface the
two supply voltage levels of 1.1 and 1.8 V. In [32, 49, 50], the NV-FF was imple-
mented as a part of write driver circuit. As a result, the transistor sizes in these designs
were quite large leading to higher parasitic capacitance. This affected the operational
speed of the FF as well as its data integrity. Magnetic FF proposed by Sakimura et al.
[32] gave a maximum operating frequency of 500 MHz with 1 ns data backup time.
Endoh et al. [50] proposed a PFET based 1T-1MTJ NV-FF with operating frequency
of 600 MHz. Kazi et al. [51] proposed two OxRAM-based NV-FF exploiting sub
VT operation enabling zero leakage sleep states. The FF operated at 2 V and had a
current compliance of 10 µA. The write energy was OxRAM dependent while the
sub VT operation reduced the read energy by 5.4%. The restore operation was done
at 0.4 V. In recent work by Kang et al. [52], a voltage controlled Magnetic Anisotropy
(VCMA) NV-FF was proposed which exploited the magnetic anisotropy assistance
in faster switching of the magnetic devices used in the circuit. Authors reported
that due to the phenomena of VCMA, the current density and pulse duration can
be greatly reduced for MTJ switching. An improvement of 98.4% was observed in
data backup energy for VCMA STT-MRAM-based NV-FF and 89.5% improvement
was observed in data backup delay as compared to conventional STT-MRAM-based
NV-FF. While this methodology was beneficial for STT-MRAM-based NV-FF, the
margin of improvement in SHE-based NV-FF was small (74.6% in data backup
energy and 19% in data backup delay). Bishnoi et al. [53] proposed a 2 MTJ-based
NV-FF which reduced the static power consumption by 5× compared to CMOS
based FFs. However, the design proposed was bulky as it required 32 transistors and
2 MTJ cells as compared to 26 transistors used in conventional CMOS based FFs. A
Ferroelectric-Based Nonvolatile FF for wearable health care systems was proposed
by Izumi et al. in [54]. The FF was based on storing complementary data in coupled
ferroelectric capacitors, that enabled the reduction in the capacitor size by 88%. The
FF had a read voltage margin of 240 mV at 1.5 V, which resulted in 2.4 pJ low access
energy with 10-year (at 85 ◦ C) data-retention capability. Ali et al. [55] also proposed
a MTJ-based NV-FF which was aimed for power gating application. The proposed
design could achieve 80% less area as compared to traditional STT-MRAM-based
NV-FF with a backup energy of 111 fJ and restore energy of 6.9 fJ. The backup and
restore time achieved were 3 ns and 0.16 ns respectively.
All the above designs are based on off-loading of data when a controlled power-
down signal is applied. These designs do not take care of the fact that power outage
might also be due to glitches which leads to loss of data since the data during normal
phase is not backed up. Some designs use a battery backup for such cases where
a sudden power loss brings the FF to a battery mode which is charged enough to
backup the states to the NVM block. This battery backup clock requires extra area
and therefore increases the overhead. Moreover, the designs which do not have a
battery backup design to backup the data during sudden power-loss over-optimizes
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 35

the fact that power glitches will not corrupt the data. It is a known fact that the
circuit concepts used in developing NV-SRAM can be extended to designing NV-FFs
[44]. We therefore take into consideration the points mentioned above and come up
with a real-time data-backup-based NV-FF which is based on the 4T-2R NV-SRAM
proposed in [24].

2.3 NV-SRAM: Principle, Programming Schemes and


Stability Analysis

4T-2R NV-SRAM bitcell discussed in this study is shown in Fig. 2.3a [24]. Figure 2.3b
shows the IV characteristics of 3 nm thick HfOx based OxRAM devices obtained
using compact model described in [56]. To realize the nonvolatility in 4T-2R NV-
SRAM bitcell, the pull-up transistors in SRAM bitcell are replaced by OxRAM
devices. OxRAM devices actively participate during NV-SRAM programming and
helps retaining the logic state during power-down mode. NV-SRAM bitcell has two
modes of operation: Write mode and Read mode. OxRAM devices are programmed
only during the Write Mode. True nonvolatility of the NV-SRAM bitcell is achieved
as data can be retrieved from the OxRAM devices not only after a controlled power-
down but also after an abrupt power failure.

2.3.1 Programming Schemes

For NV-SRAM, to encode the data in the OxRAM devices, we have proposed different
programming schemes [24, 25]. The programming schemes are classified on the basis

Fig. 2.3 a Circuit schematic of 4T-2R NV-SRAM bitcell (redrawn from [24]), b DC IV character-
istics of 3 nm thick HfOx based OxRAM device used in this study (modelled in [56])
36 S. Majumdar et al.

of their approach to program the OxRAM devices, e.g. (i) sequential programming
in which the two OxRAM devices are programmed in two cycles, and (ii) parallel
programming in which both the OxRAM devices are programmed in single cycle. The
working principle, advantages and trade-offs for the aforementioned programming
schemes are summarized below:

2.3.1.1 Two-Cycle Programming Scheme

In two-cycle programming scheme, OxRAM devices are programmed serially in the


bitcell. On PL, a two-cycle programming pulse is applied with peak amplitude =
1.6 V. PL is a 2 µs long pulse with 1 µs pulse for RESET (PL = 1) and 1 µs pulse
for SET (PL = 0) programming. During the first cycle (PL = 1), OxRAM device
connected to the internal node storing logic state ‘0’ undergoes RESET switching
(as VT B is negative) whereas during second cycle (PL = 0), the OxRAM device con-
nected to the internal node storing logic state ‘1’ undergoes SET switching (as VT B
is positive). Note: VT B (O x1) = VB L − V P L , VT B (O x2) = VB L B − V P L . Initially,
both OxRAM devices are in strong SET state and current through them is of the
orders of few ≈mA (higher power dissipation during the first programming cycle).
Figure 2.4 shows the switching activity in both the OxRAM devices while writing
logic states ‘1’ and ‘0’ to the 4T-2R NV-SRAM bitcell. It is to be noted that the time
required to program OxRAM device to RESET state (≈470 ns) is more than the
time required to program it to SET state (≈500 ns). During RESET programming
(first cycle), ≈390 nA flows through the OxRAM device and the post-programming
resistance is nearly 2 M. During SET programming (second cycle), ≈2.8µA flows
through the OxRAM device and the post-programming resistance is nearly 268 k.
This programming methodology is called as two-cycle LRS-HRS scheme and the
resistance window achieved using this scheme is ≈7.6×. In this scheme, OxRAM
device in LRS determines the limiting parameters for the NV-SRAM performance

Fig. 2.4 During Write ‘1’ operation: switching in a Ox1 and b Ox2 devices for LRS-HRS and
HRS1-HRS2 programming schemes [24]
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 37

Fig. 2.5 Operational modes for Read and Write operations in a two-cycle programming scheme,
and b single-cycle programming scheme using pulse engineered signals at Programming line (PL)
and Bitline (BL) [25]

as the maximum current flows through it during both read and Write operations. As
the resistance of OxRAM device decreases, larger pull-down transistors are required
to handle the current flowing in the circuit. This mitigates the inherent advantage of
using fewer transistors in 4T-2R NV-SRAM design. The other disadvantages of using
LRS-HRS scheme are: higher power dissipation, sneak paths and lower SNM (Static
Noise Margin). To mitigate some of these issues, an efficient programming scheme
HRS1-HRS2 can be used instead of LRS-HRS. In HRS1-HRS2 scheme, one of the
OxRAM device is programmed using a weak-SET while the other OxRAM device
is programmed in RESET state. This lowers down the switching energy/bit and pull-
down transistor area. In this scheme, peak amplitude of PL is kept as 1.2 V (as 90 nm
CMOS uses similar voltage ranges for its operation). For Write logic ‘1’, data is
loaded to BL and its complementary data is loaded to BLB. While programming,
the effective positive VT B across the OxRAM device storing ‘1’ and the negative
VT B across the OxRAM device storing logic ‘0’ is less than the positive and negative
VT B when OxRAM was programmed using PL = 1.6 V respectively. This results in
different SET and RESET resistance states (0.68 M and 2.04 M resp.) in each
OxRAM device using HRS1-HRS2 (see Fig. 2.4). Using HRS1-HRS2, VT B for SET
switching is 313 mV (compared to 750 mV using LRS-HRS) and for RESET switch-
ing is −780 mV (compared to 797 mV using LRS-HRS). The NMOS transistor width
and write energy are lowered down to 240 nm (640 nm in LRS-HRS) and 0.414 pJ
(1.8 pJ in LRS-HRS) using energy efficient HRS1-HRS2 scheme. Detailed timing
diagram for two-cycle programming scheme is shown in Fig. 2.5a.

2.3.1.2 Single-Cycle Programming Scheme

In this scheme, the PL and BL signals are modified in such a way that OxRAM devices
are programmed simultaneously in a single cycle. In this scheme, a triangular pulse
with equal rise and fall times is applied at the PL line providing the required amplitude
and polarity of VT B to switch the OxRAM devices in NV-SRAM simultaneously.
38 S. Majumdar et al.

Figure 2.5b shows the timing diagram for single-cycle programming scheme. For
data write ‘1’, the BL line is slowly ramped to 1.2 V while BLB line is kept at 0 V.
When the access transistors are turned on, the internal nodes Q and QB reflect the
data writes at BL and BLB. This action is supported further due to the cross-coupled
connection between the NMOS pull-down transistors (M1 and M2). Figure 2.6a
shows the triangular pulse as applied to the PL line. It can be seen that depending
on the potential difference across the device (VT B ) due to voltage values at PL and
BL/BLB, the OxRAM devices are either SET or RESET. For Node QB (as shown in
Fig. 2.6b), polarity of VT B stays negative (with peak amplitude −1.6 V) throughout
the triangular single-cycle pulse applied at PL (as BLB = 0 V). Ox2 device switches
from LRS → HRS, resulting in negligible current through it. As a result, QB stabilizes
at 0 V (logic ‘0’) and transistor M1 is turned off. Figure 2.6c, d shows the resistive
switching at Ox1 and Ox2. Due to the modulation of VT B across Ox1, the device
switches twice in the first write cycle owing to the fact that the device started from
an initial LRS state. A point to note here is that the double switching in the OxRAM
device will be a one time phenomena and will only be visible during the first write
cycle unless otherwise the devices are re-initialized. Meanwhile, the potential drop
across Ox2 will be negative for the entire write cycle. A similar phenomena is evident
when writing data ‘0’ to the bitcell. Table. 2.1 shows the comparison in the resistive

Fig. 2.6 Applied PL, BL and BLB signals during Write logic ‘1’ operation for a Node Q, b Node
QB. c Ox1 switching during RESET and SET regions (inset: switching activity during the first
cycle), and d Ox2 switching during RESET region [25]
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 39

Table 2.1 Absolute programming times for programming the OxRAM devices used in 4T-2R
NV-SRAM bitcell [25]
Peak BL (V) Programming time during NV-SRAM Programming time during NV-SRAM
Write ‘0’ (ns) Write ‘1’ (ns)
Ox1 (RESET) Ox2 (SET) Ox1 (SET) Ox2 (RESET)
1.3 316 878 875 387
1.2 326 897 897 357
1.1 326 897 917 384
1 386 970 936 316

switching parameters for data write ‘0’ and ‘1’. For the proposed methodology, the
device RESETs in 357 ns and SETs in 168 ns (logic ‘1’ write).
Impact of PL and BL Signals on Single-Cycle Programming: When considering
the single-cycle operation, the amplitude, rise and fall times and the pulse width of
the control and data signals PL and BL/BLB, are the key parameters which determine
the operability of the NV-SRAM bitcell. For the OxRAM device used in the design,
the pulse width of the write cycle is taken as 1 µs. Programmed resistance state
based storage in the proposed 4T-2R NV-SRAM depends on the magnitude and
polarity of VT B . Impact of the peak amplitude of BL (keeping PL fixed at 1.6 V)
on OxRAM device switching is shown in Fig. 2.7a, b. The potential drop across
the OxRAM devices (VT B ) is affected as the slope of the data signal BL is varied.
As the maximum amplitude (Vdatamax ) is increased, VT B (O x1) is decreased, which
signifies that a weaker programming condition is applied to the device. This results
in programming of the OxRAM devices Ox1 and Ox2 in different SET and RESET
resistance states.
The programming of OxRAM devices also depend on the peak voltage of PL as
shown in Fig. 2.7c, d. Since the previous programming state of the OxRAM devices
governs the subsequent programming conditions of the device (specifically RESET
state going to subsequent SET state), the OxRAM devices switches for varying
values of VT B (Fig. 2.7c). It can be observed that the RESET switching times remain
constant. This results in same initial condition for Ox2 (Fig. 2.7d). From this figure it
can be observed that the SET switching time of OxRAM increases with the amplitude
of PL. This is because more time is needed to build up the desired VT B across the
OxRAM terminals. It is observed from Fig. 2.7 that by modulating the slopes of
PL and BL signals, the latency of the 4T-2R NV-SRAM bitcell can be tuned in
single-cycle programming approach.
Furthermore, by varying the rise and fall times of the applied PL signal (i.e.
having an asymmetric triangular pulse) the programming time of the NV-SRAM
bitcell can be tuned. This is because the rise and fall times determines the rate at
which the potential drop across the device is developed to program the OxRAM
devices to SET/RESET states. Figure 2.6a gives a fair idea on the modulation of the
SET and RESET region of the OxRAM device for applied PL and BL/BLB signals.
For switching of the device from HRS → LRS, the state of the OxRAM is modulated
by varying the rise and fall times of the PL signal (Fig. 2.8a, b). A point to note here is
that the RESET operation of the device occurs during the rise time of PL (Fig. 2.6a).
40 S. Majumdar et al.

Fig. 2.7 Effect on VT B required for switching, due to change in peak amplitude of a, b BL (1–1.3
V) keeping peak amplitude of PL = 1.6 V, and c, d peak amplitude of PL (1.5–1.8 V) keeping peak
amplitude of BL = 1 V [25]

With reduction in rise time, the slope of PL increases. The VT B required for making
transition from LRS → HRS is achieved faster, thus reducing the switching time of
the device. Correspondingly, the device achieves a SET state faster as the fall time
of PL is increased. Figure 2.8c, d represents the variation in the resistance values and
switching times of Ox1/Ox2 with change in rise time of asymmetric PL signal.
An advantage of single-cycle programming scheme over double-cycle program-
ming is less energy required during write operation (≈80 fJ for HRS1-HRS2 as
compared to 1.8 pJ for LRS-HRS scheme [25]). The low energy is due to the fact
that the OxRAM devices stay in RESET region for 60% of the total programming
time during which a small amount of current flows through the device (∼nA). Fur-
thermore, the programming time of the single-cycle scheme is reduced by half as
compared to the two-cycle scheme making the single-cycle programming scheme an
energy and latency efficient approach.

2.3.1.3 Read Operation for Two-Cycle and Single-Cycle Programming


Schemes

The approach to read the programmed bitcell for both two-cycle and single-cycle
programming is same. To read the cell the bitlines are precharged to Vdd /2 which
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 41

Fig. 2.8 Change in VT B at switching instant, for a Ox1 and b Ox2 with change in the duration
of rising edge of PL voltage pulse. Change in the c switching time and d resistance of OxRAM
devices (after successful Write operation) by modulating the duration of rising edge of the pulse
applied at PL, keeping peak amplitude of BL signal at 1 V [25]

corresponds to the state ‘a’ in Fig. 2.5. Following that WL is asserted and a read
voltage is applied to PL (state ‘b’). Current flows through Ox1 and Ox2 depending on
the resistance state to which it is programmed. OxRAM device which is programmed
to a higher resistance value will allow less current to flow through it as compared to
OxRAM device programmed to a low resistance state. The current through the device
will charge or discharge the internal node and will, in turn, pull-up pr pull-down the
BL/BLB lines. This approach is similar to the read in a conventional SRAM cell.
The sense amplifier to differentiate the data written in the bitcell in such a case can
be a voltage control sense amplifier (VCLA). Another approach to read the bitcell is
to use read voltage to capture the current through the device. In such case, we use a
current controlled sense amplifier (CCSA). In this scheme, a read voltage (Vr ead ) is
applied to the PL and current corresponding to the resistive state flows through the
device. Since WL is asserted, the current flows through the BL/BLB lines, which is
converted to voltage levels by the sense amplifier enabling data read from bitcell.
The advantage of such a read scheme is that there is no need of a precharge circuit
to precharge the bitlines. This reduces the area overhead of the overall NV-SRAM
array.
42 S. Majumdar et al.

2.3.2 Stability Analysis

Stability of a memory cell is an important aspect to look into since it quantifies


the amount of noise that it can tolerate without flipping the logic state stored in
it. If the noise crosses the threshold value, the stability of the cell is compromised
due to unwanted fluctuations at the output node. This degradation further leads to
read disturbs and write failures. The key aspects of cell stability are defined by two
approaches—butterfly curves for read, write and hold [58] and N-curve [59]. The
metrics obtained from these approaches enable the designers to make a more robust
and a stable cell [59]. Using the voltage and current information from these stability
approaches, a designer can understand the implications of stability metrics on the
intrinsic and extrinsic properties of the bitcell.

2.3.2.1 Static Noise Margin (SNM)

Conventionally, stability of SRAM bitcell is defined using SNM [58]. SNM is the
maximum value of DC noise voltage Vn that can be tolerated by the memory bitcell
without changing the logic state. For 4T-2R NV-SRAM bitcell (at 90 nm technology
node), the hold, read and write SNM are 0.3 V, 0.13 V and 0.42 V respectively.
For a SRAM bitcell with cell ratio (CR) 2 and pull-up ratio (PR) 1, the hold, read
and write SNM values are 0.5 V, 0.15 V and 0.5 V respectively [57]. Figure 2.9a–c
shows the effect of Vdd scaling on hold, read and write SNM for 4T-2R NV-SRAM.
Figure 2.9d–f show the hold, read and write SNM curves for 4T-2R NV-SRAM with
pull-down transistor width (M1 and M2) in range 200 nm–2 µm. The width of M3/M4
is kept constant at 180 nm. It is observed that read SNM is a strong function of CR.
For lower CR values, Read operation fails, hence for reliable Read operation CR
needs to be equal to, or greater than 2.2. Furthermore it is observed for successful
Write operation, pull-down transistor (M1 and M2) width of 200 nm (CR ≈ 1.11)
is desirable, however due to destructive Read operation bitcell needs to be designed
with CR ≥ 2.2 [57].

2.3.2.2 N-curve

It is evident that SNM considers only the voltage matrices of SRAM/NV-SRAM cell
to analyze the bitcell stability. N-curve method [59], which considers both voltage
and current matrices, gives the following stability matrices—SVNM (static voltage
noise margin), SINM (static current noise margin), WTV (write-trip voltage) and
WTI (write-trip current). Read stability criteria is defined using SVNM and SINM.
A small SVNM combined with a large SINM (or vice versa) results in a stable cell
because the Vn required to disturb the cell is large [59]. Table 2.2 summarizes N-curve
parameters calculated for 6T SRAM and 4T-2R NV-SRAM [57]. By modulating the
pull-down transistor width (i.e. by changing CR) of the NV-SRAM cell and Vdd
amplitude, N-curve characteristics are plotted (shown in Fig. 2.9g–i). It is observed
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 43

Fig. 2.9 Simulated 4T-2R NV-SRAM bitcell—a, d Hold SNM b, e Read SNM and c, f Write
SNM, for different Vdd values and different pull-down transistor widths respectively. N-curves for
6T SRAM and 4T-2R NV-SRAM are shown in g and impact of h different Vdd and i pull-down
transistor widths on N-curves of 4T-2R NV-SRAM bitcell [57]

Table 2.2 N-curve parameters for 6T SRAM and 4T-2R NV-SRAM bitcell [57]
Parameters SVNM (mV) SINM (µA) WTV (mV) WTI (µA)
6T SRAM 489.9 113.1 745.8 65.8
4T-2R NV-SRAM 431.7 134.7 709.56 66.4

that with increasing Vdd and pull-down transistor size, there is improvement in SINM,
WTI and WTV, while SVNM remains almost constant.
It is evident that 4T-2R NV-SRAM bitcell offers numerous advantages over
other NV-SRAM designs proposed in literature, such as (i) real-time nonvolatility,
(ii) permits unconventional transistor sizing, (iii) low area footprint and (iv) low-
power operation. For quantitative comparison, we have presented in Table 2.3 the
comparison of 4T-2R NV-SRAM bitcell with other NV-SRAM implementations
proposed in literature so far.
44 S. Majumdar et al.

Table 2.3 Comparison of different 4T-2R NV-SRAM bitcells and state-of-the art 6T SRAM bitcell
[25]
Parameters 4T-2MTJ 4T-2R 4T-2R 4T-2R 6T SRAM
[27] [60] [24] [25] [61, 62]
Volatility NV NV NV NV V
NV device* MTJ STI-OxRAM OxRAM OxRAM –
Tech. node** 32 nm (Sim.); 40 nm (Fab.) 90 nm (Sim.) 90 nm (Sim.) 10 nm (Fab.)
90 nm (Fab.)
Prog. Two step Two step Two step One step One step
Vdd (V) 1 2.8 1.6 1.2 1.6 0.6
Write time 25 ns 5 µs 2 µs 2 µs 1 µs 0.6 ns
Pull-down 3 µm – 640 nm 240 nm 200 nm 70 nm
transistor
size
R L RS () 1k 20–400 k 268 k 0.68 M 264 k –
R H RS () 2k ≈2 M 2.04 M 2.04 M 2.04 M –
Switching 400 µA 100 µA 2.8 µA 456 nA 2.7 µA –
current
SNM (mV) 340 258 250 300 200
*NV Nonvolatile, V Volatile;
**Fab. Fabricated, Sim. Simulated

2.4 Real-Time NV-FF Based on 4T-2R NV-SRAM Circuit

Figure 2.10 shows the schematic of the proposed real-time NV-FF. The circuit uses
four OxRAM devices to store the data in real-time when it is transferred from D to
Q. The major advantages of this NV-FF are:
• The circuit is implemented in a small area as compared to both the conventional
CMOS based FF and off-loading-based NV-FF.
• The circuit offers zero leakage current during off-state of the NV-FF.
• The circuit takes care of power glitches during active/normal operating mode, that
may cause the data to be corrupted.
• The circuit is easy to design and replaces the PMOS transistors in the conventional
CMOS based FF and NV-FF. Thus cost effective.
The proposed NV-FF design consists of two modules—(1) master block (2) slave
block. Unlike traditional NV-FF which has 3 operating modes (active, store and
restore modes), the proposed NV-FF has only two operating modes—active/normal
mode (which also stores the data to the nonvolatile device) and restore mode.
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 45

Fig. 2.10 Schematic of the real-time NV-FF. It is similar to the conventional CMOS based FF in
terms of modules constituting it—Master Block and Slave Block

2.4.1 Operating Modes of the Real-Time NV-FF

1. Active/Normal Mode: In this mode when CLK = 0 is asserted and data D = 0


latches on to the input, the NMOS transistor M1 in the master block turns on.
This leads to the load capacitor C L1 of the master block to be discharged to ‘0’.
To store the data in the OxRAM device, PL is taken as a two-cycle signal wherein
it transitions from ‘1’ to ‘0’ after a pulse of duration equal to the switching time
of the OxRAM device. As the internal capacitor (C L1 ) discharges, a low poten-
tial appears at the bottom electrode of the OxRAM device connected to the M1
transistor. When PL = 1, this OxRAM gets programmed to a RESET state (HRS)
since VT B < 0. A PL = 1 is followed by a PL = 0 which does not disturb the
state of the OxRAM since VT B = 0. A ‘0’ at the output node of the 1T-1R in the
feed-forward path of the master block switches off the NMOS transistor M2 in
the master block’s feedback path. The OxRAM device connected to M2 slowly
charges the internal node capacitor C L2 to logic ‘1’ and holds the state. The pro-
gramming of this OxRAM device in the feedback path is similar to that in the
feed-forward path. At PL = 1, the OxRAM in the feedback path does not program
as VT B ≈ 0. When PL goes to ‘0’, due to charging of the internal load capacitor
C L2 , VT B > 0. This programs the OxRAM to a SET state. It can be noted here
that the OxRAM devices in the block are always programmed to opposite states
whenever a data is applied. The master block is followed by an inverter whose
output is fed as an input to the slave block. Therefore, the internal node Qm
holds an inverted value as compared to the input D (Qm = 1 when D = 0). When
CLK = 1, D is isolated from the master block and the output of the 1T-1R in
the master block’s feedback path is connected to the gate of the transistor M1.
Qm = 1 is applied to the gate of the transistor M3, turning it on. This discharges
the load capacitor C L1 of the slave block to ‘0’ giving an output of ‘0’ at Q. Q =
46 S. Majumdar et al.

0 turns off the transistor M4 and therefore, the OxRAM connected to it slowly
charges the internal node capacitor C L2 to ‘1’. The programming of the OxRAM
devices in the slave block is same as that in the master block.
A point to note here is that there is no external control signal which moni-
tors/triggers the data off-loading. This reduces the number of external connections
to the NV-FF thus easing routing and pin/terminal congestion.
2. Power-down Mode: During the power-down mode, all the signals are pulled down
to zero and the FF goes to a standby mode. Since the nonvolatile devices store
the data as its resistive state, it is not lost and can be restored when the NV-FF
comes back to power.
3. Restore Mode: A CLK = 0 and PL = 1 are asserted when the NV-FF block is
turned on. This allows a current to flow through OxRAM devices (connected to
M3 and M4) depending on the resistive state to which they are programmed. The
OxRAM device connected to M4 (programmed to a SET state) charges the gate
of transistor M3 to a logic ‘1’ turning it on. This leads to the discharge of internal
node capacitor C L1 in the slave block to logic ‘0’ restoring the data at the output
Q of the NV-FF.

We can observe here that the data off-loading occurs at the normal mode but only
the OxRAM devices in the slave block participate in the data restoring. In addition
to this, the NV-FF in this case is slower than the conventional NV-FF since the total
time to transfer the data to the output (T D−Q ) is equal to ≈2× the programming
time of the OxRAM device. The total time needed to transfer and store the data in
real-time is:
TD−Q = Tmaster + Tslave (2.1)

TD−Q = max(T f eed− f or war d O x R AM , T f eedback O x R AM )+ (2.2)


max(T f eed− f or war d O x R AM , T f eedback O x R AM ) (2.3)

Say T f eed− f or war d O x R AM = T f eedback O x R AM = T, then

TD−Q = 2T (2.4)

Therefore, the performance of the NV-FF in this case heavily depends on the pro-
gramming time of the OxRAM device. As the technology improves, faster OxRAM
devices are being proposed. Therefore, such a design proves to be beneficial in terms
of area and performance.
Figure 2.11 shows the timing diagram of the 4T-2R-based NV-FF for real-time
data storage. The transistors used in this simulation is from the 90 nm technology
node and the OxRAM model is the same as described in [56]. The FF operates
at 1.6 V. For the device model used in the simulation T R E S E T ≥ T S E T [25], thus,
Tstor e = 714 ns and Tr estor e = 2 ns. Since the write current of the OxRAM is small
(2.3 µA for programming OxRAM in LRS and 364 nA for programming OxRAM in
HRS), the transistor sized used in the latch can be kept to minimum standard sizing
without any additional parasitics.
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 47

Fig. 2.11 Timing diagram of the 4T-2R-based NV-FF. When CLK = 0, the OxRAM in master
block gets programmed and when CLK = 1, OxRAM in slave block gets programmed. R1 (feed-
forward) and R2 (feedback) are master block OxRAM, R3 (feed-forward) and R4 (feedback) are
slave block OxRAM

2.4.2 Modified NV-FF Design for Improved System


Performance

Due to the limitations posed by the NV-FF design proposed in previous section, a
modified NV-FF design is presented here. The proposed NV-FF, as shown in Fig. 2.12,
has three different modes of operation: (1) active or normal mode, (2) off-loading
or store mode and (3) restore mode. The NV-FF consists of a volatile master stage
and a single OxRAM device in the slave stage. This device stores the off-loaded
data just before power-down mode is activated. A small area overhead is required
48 S. Majumdar et al.

Fig. 2.12 Schematic of the proposed modified NV-FF depicting the 3 major operating blocks

for the proposed NV-FF (6 extra transistors in addition to the 22 transistors needed
for conventional CMOS based FFs). Only the slave latch is employed to write/read
the OxRAM device during the off-loading/restore mode without the need for any
sensing or dedicated write driver block. STR (store) and RSTR (restore) signals
are asserted such that only one signal is activated during store/restore operation.
Figure 2.13 shows the schematic of the off-loading block of the proposed NV-FF. We
can see that the block is essentially made up of three separate modules.
1. Nonvolatile Block: This block stores the data that is to be off-loaded from the
output node Q. When Q = 0, the OxRAM in this block is programmed to HRS
and when Q = 1, the OxRAM in this block is programmed to LRS.
2. Control Block: This block controls the operation being performed by the off-
loading section. It consists of a simple OR gate with two inputs: STR and RSTR.
Table 2.4 shows the operation performed by the off-loading block according to
the input combinations of the STR and RSTR signals. It is to be noted that STR
and RSTR can never be ‘1’ at the same time.
3. Data Generation Block: This block along with the control block off-loads the
data or restores the data to the output node Q when a STR or RSTR signal is
applied. The block is mainly responsible for the following two tasks:
a. Provide write data voltage (VW R ) when data is being off-loaded.
b. Provide read data voltage (V R D ) when the data is being read to restore the
data.
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 49

Fig. 2.13 The off-loading block showing the three sub-blocks. The data is off-loaded during con-
trolled power-down in a single OxRAM device. The control clock is a simple OR gate controlled
by two control signals STR and RSTR. The data generation block is used to program the device
during data off-loading and providing supply during restore operation

Table 2.4 Operations STR RSTR Operation performed


performed by the control
block during off-loading of 0 0 Normal operation of flip-flop
data to the nonvolatile block 0 1 Restore the flip-flop to the stored state
1 0 Store the data at the output node Q to
OxRAM (Data off-loading state)
1 1 Invalid state

For the proposed circuit VW R is taken as 1.6 V and V R D is taken as 0.4 V. The
read voltage has to be chosen such that the internal state of the OxRAM is not
disturbed during data read.

2.4.3 Operating Modes of the Proposed NV-FF

1. Active/Normal Mode: During active or normal mode, both the STR and RSTR
signals are held at logic ‘0’ and the terminals of the OxRAM device are grounded
(VT B = 0). Therefore the OxRAM device does not participate in the normal FF
operation. When CLK = 0, the master stage latches the input from data line D.
When CLK = 1, the feedback path in the master stage holds the last sampled
50 S. Majumdar et al.

value and the data is transferred to the slave stage. The operation continues till
a power-down/sleep-mode is activated.
2. Store Mode: In this mode the control signals, STR = 1 and RSTR = 0, are
asserted to off-load the data to the OxRAM device. This leads to a logic ‘1’ output
at the control block thereby switching on the two transistors in nonvolatile block
and the data generation block respectively (refer Fig. 2.14). Since RSTR = 0,
both the multiplexers in the data generation block selects the first input to assert
its value at the output. Therefore, on one hand VW R is chosen which provides the
data write voltage of 1.6 V to the power supply of the inverter, on the other hand
the data which is to be off-loaded is provided at the input of the inverter through
the other MUX. It can be seen that the output of the inverter is opposite to the
data value being stored. This makes sure that the polarity of the voltage applied
to the OxRAM is properly maintained. When Q = 0, TE is at lower potential
and BE is at higher potential (VT B < 0), therefore the OxRAM is programmed
to HRS. Similarly, when Q = 1, TE is at higher potential than BE (VT B > 0)
and therefore the OxRAM is programmed to LRS. After the data is written to
the OxRAM block, the power-down signal is asserted to switch off the NV-FF.
It is to be noted here that the FF has to wait till the OxRAM is programmed so
as to avoid any kind of data corruption during off-loading.
3. Power-Down/Sleep-Mode: In sleep-mode, all the data and the control signals
are pulled down. Since the OxRAM device stores the data as its resistive state,
the data off-loaded to it remains stored. As the system is fully switched off, the
leakage current of this block is negligible.
4. Restore Mode: During the restore mode, CLK = 0 and RSTR = 1 are asserted
thereby switching ON the transistors in the nonvolatile and the data generation

Fig. 2.14 Store operation in the proposed modified NV-FF. Green data shows the polarities and
data values at off-loading block circuit nodes when Q = 0 is to be stored in OxRAM. Blue data
shows the polarities and the values at off-loading block circuit nodes when Q = 1 is to be stored.
The red data shows the control block signals
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 51

blocks. The data block provides a read voltage (V R D = 0.4 V) to the OxRAM
which restores the data in the slave latch. When the OxRAM is in SET state,
the internal node Qm charges to a logic ‘1’. The action of charging node Qm is
supported by the inverters in the feedback circuit. Due to the presence of inverter
between the data store/restore node Qm and output node Q, the original data is
restored at the output of the FF. Similar steps are followed when logic ‘1’ is
restored from OxRAM. Figure 2.15 shows the restore operation in the propose
FF circuit.

Fig. 2.15 Restore operation in the proposed modified NV-FF. Here V R D provides the read voltage
that is needed to read the state stored in the OxRAM. The Ir ead current obtained from the OxRAM
charges the load capacitance at the output to a logic value ‘1’ or ‘0’ depending on the state of the
OxRAM (LRS or HRS)

Fig. 2.16 Timing diagram showing different operating modes of the modified NV-FF. The off-
loading is controlled by the control signals STR and RSTR
52

Table 2.5 Comparison of various NV-FF designs proposed in literature and the proposed NV-FF designs
Parameters [3] [31] [32] [51] [54] 4T-2R NV-FF NV-FF w/data
off-load
NV Device MTJ OxRAM STT/SHE OxRAM Ferroelectric OxRAM OxRAM
capacitors
Simulated/fabricated Sim. Fab. Fab. Sim. Fab. Sim. Sim.
Technology node (nm) 22 65 150 180 130 90 90
Programming voltage 1.1 1 1.5 2 1.5 1.6 1.6
(V)
Store delay − 4 µsa 50 µsb − 170 ns 714 ns 357 ns
Restore delay − 16 nsa 20 µsb − 160 ns 2 ns 2 ns
Store energy 0.57 pJ 46.2 pJ − 735 fJc 2.4 pJ 186 pJ 93 pJ
Restore energy 58 fJ 9.2 fJ − 735 fJc 2.34 pJ 0.4 pJ 0.4 pJ
Sim. Simulated, Fab. Fabricated
a Per 1000 NV-FFs
b For 1000 instructions in per operational clock cycle of chip
c For 0.8 V
S. Majumdar et al.
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 53

The timing diagram of the simulated NV-FF block is shown in Fig. 2.16 with
different operating modes. The simulations were based on the OxRAM model as
given in [56] and 90 nm technology node. The flip-flop operates at 1.6 V. Data is
stored in minimum Tstor e = 15 ns and restored in minimum Tr estor e = 2 ns. The
NV-FF has a backup energy of 3.08 pJ and restore energy of 0.4 pJ. Since the write
current of the OxRAM is small (2.3 µA for programming OxRAM in LRS and 364
nA for programming OxRAM in HRS), the transistor sized used in the latch can be
kept to minimum standard sizing without any additional parasitics.
Table 2.5 shows the comparison of the proposed NV-FFs with other NV-FF designs
present in literature. The NV-FF considered in this table ranges from data off-loading
in OxRAM, Ferroelectric capacitors to MTJ based devices. While conventional NV-
FF rely on serial [36, 40, 41, 55, 63] or two-phase writing [38, 39, 42] during data
off-loading, the proposed NV-FF uses a single NV device which works on parallel
data writing. This drastically reduces the access time of the NV-FF and the overall
energy of the circuit.

2.5 Conclusion

In this chapter, we have presented a real-time 4T-2R NV-SRAM bitcell using HfOx
based OxRAM devices. We have explained its different operational modes (i.e. Write
mode and Read mode) along with multiple programming approaches (Two-cycle and
Single-cycle programming schemes). Since stability of NV-SRAM bitcell has been a
concern, we presented a detailed analysis summarizing the impact of Vdd scaling and
transistor down-scaling on the stability metrics (SNM and N-curve). It is observed
that using 4T-2R NVSRAM there is a possibility of transistor down-scaling and
lower switching current enables low-power circuit design. We further extended the
scope of 4T-2R NV-SRAM bitcell by proposing a real-time NV-FF using it. We also
discussed the shortcomings of having OxRAM device actively participating in the
normal operation of NV-FF and proposed a modified NV-FF design to mitigate the
issues. Although the major challenge pertaining to the design of NV-SRAM and
NV-FF is to take care of the abrupt power glitches, active participation of OxRAM
device slows down the overall circuit. We believe, with advancement in the material
science engineering, this challenge will be addressed. Some developmental works
by [10, 64–69] give us an idea that the development in this area has started picking
up. Thus in days to come, OxRAM based real-time designs will be not only be area
and power efficient but show better performance in terms of latency and energy.

References

1. J. Abouei, J.D. Brown, K.N. Plataniotis, S. Pasupathy, Energy efficiency and reliability in
wireless biomedical implant systems. IEEE Trans. Inf. Technol. Biomed. 15(3), 456–466 (2011)
2. A.C.K. Chan, S. Okochi, K. Higuchi, T. Nakamura, H. Kitamura, J. Kimura, T. Fujita, K.
Maenaka, Low power wireless sensor node for human centered transportation system, in 2012
54 S. Majumdar et al.

IEEE International Conference on Systems, Man, and Cybernetics (SMC) (IEEE, 2012), pp.
1542–1545
3. A.S. Iyengar, S. Ghosh, J.-W. Jang, MTJ-based state retentive flip-flop with enhanced-scan
capability to sustain sudden power failure. IEEE Trans. Circuits Syst. I: Regul. Pap. 62(8),
2062–2068 (2015)
4. T. Lin, K.-S. Chong, B.-H. Gwee, J.S. Chang, Fine-grained power gating for leakage and
short-circuit power reduction by using asynchronous-logic, in IEEE International Symposium
on Circuits and Systems, 2009. ISCAS 2009 (IEEE, 2009), pp. 3162–3165
5. S. Onkaraiah, M. Reyboz, F. Clermidy, J.-M. Portal, M. Bocquet, C. Muller, C. Anghel, A.
Amara et al., Bipolar reram based non-volatile flip-flops for low-power architectures, in 2012
IEEE 10th International New Circuits and Systems Conference (NEWCAS) (IEEE, 2012), pp.
417–420
6. S.K. Thirumala, A. Raha, H. Jayakumar, K. Ma, V. Narayanan, V. Raghunathan, S.K. Gupta,
Dual mode ferroelectric transistor based non-volatile flip-flops for intermittently-powered sys-
tems, in Proceedings of the International Symposium on Low Power Electronics and Design
(ACM, 2018), p. 31
7. P.-F. Chiu, M.-F. Chang, W. Che-Wei, C.-H. Chuang, S.-S. Sheu, Y.-S. Chen, M.-J. Tsai, Low
store energy, low VDDmin, 8T2R nonvolatile latch and SRAM with vertical-stacked resistive
memory (memristor) devices for low power mobile applications. IEEE J. Solid-State Circuits
47(6), 1483–1496 (2012)
8. M. Ueki, K. Takeuchi, T. Yamamoto, A. Tanabe, N. Ikarashi, M. Saitoh, T. Nagumo, H. Suna-
mura, M. Narihiro, K. Uejima et al., Low-power embedded ReRAM technology for IoT appli-
cations, in 2015 Symposium on VLSI Technology (VLSI Technology) (IEEE, 2015), pp. T108–
T109
9. I.G. Baek, C.J. Park, H. Ju, D.J. Seong, H.S. Ahn, J.H. Kim, M.K. Yang, S.H. Song, E.M. Kim,
S.O. Park et al., Realization of vertical resistive memory (VRRAM) using cost effective 3D
process. In 2011 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2011), pp.
31–38
10. S. Yu, H.-Y. Chen, B. Gao, J. Kang, H.-S.P. Wong, HfOx -based vertical resistive switching ran-
dom access memory suitable for bit-cost-effective three-dimensional cross-point architecture.
ACS nano 7(3), 2320–2325 (2013)
11. D. Ielmini, Resistive switching memories based on metal oxides: mechanisms, reliability and
scaling. Semicond. Sci. Technol. 31(6), 063002 (2016)
12. X. Dong, N.P. Jouppi, Y. Xie, A circuit-architecture co-optimization framework for evaluat-
ing emerging memory hierarchies, in 2013 IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS) (IEEE, 2013), pp. 140–141
13. X. Xue, W. Jian, Y. Xie, Q. Dong, R. Yuan, Y. Lin, Novel RRAM programming technology
for instant-on and high-security FPGAs, in 2011 IEEE 9th International Conference on ASIC
(ASICON) (IEEE, 2011), pp. 291–294
14. P.-F. Chiu, M.-F. Chang, S.-S. Sheu, K.-F. Lin, P.-C. Chiang, C.-W. Wu, W.-P. Lin, C.-H. Lin,
C.-C. Hsu, F.T. Chen et al., A low store energy, low vddmin, nonvolatile 8T2R SRAM with 3d
stacked RRAM devices for low power mobile applications, in 2010 IEEE Symposium on VLSI
Circuits (VLSIC) (IEEE, 2010), pp. 229–230
15. Y. Zheng, P. Huang, H. Li, X. Liu, J. Kang, G. Du, Simulation of the RRAM based nonvolatile
SRAM cell, in 2014 12th IEEE International Conference on Solid-State and Integrated Circuit
Technology (ICSICT) (IEEE, 2014), pp. 1–3
16. S. Yamamoto, S. Sugahara, Nonvolatile static random access memory using magnetic tun-
nel junctions with current-induced magnetization switching architecture. Jpn. J. Appl. Phys.
48(4R), 043001 (2009)
17. A.M.S. Tosson, A. Neale, M. Anis, L. Wei, 8T1R: A novel low-power high-speed RRAM-
based non-volatile SRAM design, in 2016 International Great Lakes Symposium on VLSI
(IEEE, 2016), pp. 239–244
18. S.-S. Sheu, C.-C. Kuo, M.-F. Chang, P.-L. Tseng, L. Chih-Sheng, M.-C. Wang, C.-H. Lin, W.-P.
Lin, T.-K. Chien, S.-H. Lee et al., A reram integrated 7T2R non-volatile SRAM for normally-off
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 55

computing application, in 2013 IEEE Asian Solid-State Circuits Conference (A-SSCC) (IEEE,
2013), pp. 245–248
19. M. Takata, K. Nakayama, T. Izumi, T. Shinmura, J. Akita, A. Kitagawa, Nonvolatile SRAM
based on phase change, in 2006 21st IEEE Non-Volatile Semiconductor Memory Workshop,
NVSMW (IEEE, 2006), pp. 95–96
20. W. Wei, K. Namba, J. Han, F. Lombardi, Design of a nonvolatile 7T1R SRAM cell for instant-on
operation. IEEE Trans. Nanotechnol. 13(5), 905–916 (2014)
21. A. Lee, M.-F. Chang, C.-C. Lin, C.-F. Chen, M.-S. Ho, C.-C. Kuo, P.-L. Tseng, S.-S. Sheu,
T.-K. Ku, RRAM-based 7T1R nonvolatile SRAM with 2x reduction in store energy and 94x
reduction in restore energy for frequent-off instant-on applications, in 2015 Symposium on
VLSI Technology (VLSI Technology) (IEEE, 2015), pp. C76–C77
22. W. Wang, A. Gibby, Z. Wang, T. W. Chen, S. Fujita, P. Griffin, Y. Nishi, S. Wong, Nonvolatile
SRAM cell, in 2006 International Electron Devices Meeting, December 2006 (2006), pp. 1–4
23. K. Abe, Hierarchical nonvolatile memory with perpendicular magnetic tunnel junctions for
normally-off computing, in International conference on solid state devices and materials
(SSDM 2010) (Tokyo, Japan, 2010), p. 2010
24. S. Majumdar, S.K. Kingra, M. Suri, M. Tikyani, Hybrid CMOS-OxRAM based 4T-2R NVS-
RAM with efficient programming scheme, in 2016 16th Non-Volatile Memory Technology
Symposium (NVMTS) (IEEE, 2016), pp. 1–4
25. S. Majumdar, S.K. Kingra, M. Suri, Programming scheme based optimization of hybrid 4T-2R
OXRAM NVSRAM. Semicond. Sci. Technol. 32(9), 094008 (2017)
26. T. Ohsawa, H. Koike, S. Miura, H. Honjo, K. Tokutome, S. Ikeda, T. Hanyu, H. Ohno, T.
Endoh, 1 Mb 4T-2MTJ nonvolatile STT-RAM for embedded memories using 32b fine-grained
power gating technique with 1.0 ns/200ps wake-up/power-off times, in 2012 Symposium on
VLSI Circuits (VLSIC) (IEEE, 2012), pp. 46–47
27. T. Ohsawa, H. Koike, S. Miura, H. Honjo, K. Kinoshita, S. Ikeda, T. Hanyu, H. Ohno, T. Endoh,
A 1 Mb nonvolatile embedded memory using 4T2MTJ cell with 32 b fine-grained power gating
scheme. IEEE J. Solid-State Circuits 48(6), 1511–1520 (2013)
28. W. Robinett, M. Pickett, J. Borghetti, Q. Xia, G.S. Snider, G. Medeiros-Ribeiro, A memristor-
based nonvolatile latch circuit. Nanotechnology 21(23), 235203 (2010)
29. D. Chabi, W. Zhao, E. Deng, Y. Zhang, N.B. Romdhane, J.-O. Klein, C. Chappert, Ultra low
power magnetic flip-flop based on checkpointing/power gating and self-enable mechanisms.
IEEE Trans. Circuits Syst. I: Regul. Pap. 61(6), 1755–1765 (2014)
30. I. Kazi, P. Meinerzhagen, P.-E. Gaillardon, D. Sacchetto, Y. Leblebici, A. Burg, G. De Micheli,
Energy/reliability trade-offs in low-voltage reram-based non-volatile flip-flop design. IEEE
Trans. Circuits Syst. I: Regul. Pap. 61(11), 3155–3164 (2014)
31. A. Lee, C.-P. Lo, C.-C. Lin, W.-H. Chen, K.-H. Hsu, Z. Wang, S. Fang, Z. Yuan, Q. Wei,
Y.-C. King et al., A reram-based nonvolatile flip-flop with self-write-termination scheme for
frequent-off fast-wake-up nonvolatile processors. IEEE J. Solid-State Circuits 52(8), 2194–
2207 (2017)
32. N. Sakimura, T. Sugibayashi, R. Nebashi, N. Kasai, Nonvolatile magnetic flip-flop for standby-
power-free socs. IEEE J. Solid-State Circuits 44(8), 2244–2250 (2009)
33. W. Zhao, E. Belhaire, C. Chappert, Spin-MTJ based non-volatile flip-flop, in 2007 7th IEEE
Conference on Nanotechnology (IEEE NANO) (IEEE, 2007), pp. 399–402
34. S. Yamamoto, Y. Shuto, S. Sugahara, Nonvolatile delay flip-flop using spin-transistor architec-
ture with spin transfer torque mtjs for power-gating systems. Electron. Lett. 47(18), 1027–1029
(2011)
35. K.-W. Kwon, S.H. Choday, Y. Kim, X. Fong, S.P. Park, K. Roy, SHE-NVFF: Spin hall effect-
based nonvolatile flip-flop for power gating architecture. IEEE Electron Device Lett. 35(4),
488–490 (2014)
36. W. Zhao, E. Belhaire, C. Chappert, F. Jacquet, P. Mazoyer, New non-volatile logic based on
spin-MTJ. Phys. Status Solidi (A) 205(6), 1373–1377 (2008)
37. K. Ryu, J. Kim, J. Jung, J.P. Kim, S.H. Kang, S.-O. Jung, A magnetic tunnel junction based
zero standby leakage current retention flip-flop. IEEE Trans. Very Large Scale Integr. (VLSI)
Syst. 20(11), 2044–2053 (2012)
56 S. Majumdar et al.

38. K. Huang, Y. Lian, A low-power low-vdd nonvolatile latch using spin transfer torque MRAM.
IEEE Trans. Nanotechnol. 12(6), 1094–1103 (2013)
39. G. Prenat, K. Jabeur, G. Di Pendina, O. Boulle, G. Gaudin, Beyond STT-MRAM, spin orbit
torque ram SOT-MRAM for high speed and high reliability applications, Spintronics-Based
Computing (Springer, Berlin, 2015), pp. 145–157
40. P. Wang, X. Chen, Y. Chen, H. Li, S. Kang, X. Zhu, W. Wu, A 1.0 V 45nm nonvolatile magnetic
latch design and its robustness analysis, in 2011 IEEE Custom Integrated Circuits Conference
(CICC) (IEEE, 2011), pp. 1–4
41. Y. Jung, J. Kim, K. Ryu, J.P. Kim, S.H. Kang, S.-O. Jung, An MTJ-based non-volatile flip-flop
for high-performance SoC. Int. J. Circuit Theory Appl. 42(4), 394–406 (2014)
42. K. Jabeur, G. Di Pendina, F. Bernard-Granger, G. Prenat, Spin orbit torque non-volatile flip-flop
for high speed and low energy applications. IEEE Electron Device Lett. 35(3), 408–410 (2014)
43. Y. Wang, Y. Liu, S. Li, D. Zhang, B. Zhao, M.-F. Chiang, Y. Yan, B. Sai, H. Yang, A 3us
wake-up time nonvolatile processor based on ferroelectric flip-flops, in 2012 Proceedings of
the ESSCIRC (ESSCIRC) (IEEE, 2012), pp. 149–152
44. S. Masui, W. Yokozeki, M. Oura, T. Ninomiya, K. Mukaida, Y. Takayama, T. Teramoto, Design
and applications of ferroelectric nonvolatile SRAM and flip-flop with unlimited read/program
cycles and stable recall, in Proceedings of the IEEE 2003 Custom Integrated Circuits Confer-
ence, 2003 (IEEE, 2003), pp. 403–406
45. M. Qazi, A. Amerasekera, A.P. Chandrakasan, A 3.4-pJ feram-enabled D flip-flop in 0.13-
um CMOS for nonvolatile processing in digital systems. IEEE J. Solid-State Circuits 49(1),
202–211 (2014)
46. D. Wang, S. George, A. Aziz, Suman Datta, Vijaykrishnan Narayanan, and Sumeet K Gupta.
Ferroelectric transistor based non-volatile flip-flop, in Proceedings of the 2016 International
Symposium on Low Power Electronics and Design (ACM, 2016), pages 10–15
47. S. Shigematsu, S. Mutoh, Y. Matsuya, J. Yamada, A 1-v high-speed MTCMOS circuit scheme
for power-down applications, in VLSI Circuits, 1995. Digest of Technical Papers., 1995 Sym-
posium on (IEEE, 1995), pp. 125–126
48. Y. Jung, J. Kim, K. Ryu, S.-O. Jung, J.P. Kim, S.H. Kang, MTJ based non-volatile flip-flop in
deep submicron technology, in 2011 International SoC Design Conference (ISOCC) (IEEE,
2011), pp. 424–427
49. S. Yamamoto, Y. Shuto, S. Sugahara, Nonvolatile flip-flop using pseudo-spin-transistor archi-
tecture and its power-gating applications, in 2012 International Semiconductor Conference
Dresden-Grenoble (ISCDG) (IEEE, 2012), pp. 17–20
50. T. Endoh, T. Ohsawa, H. Koike, T. Hanyu, H. Ohno, Restructuring of memory hierarchy in com-
puting system with spintronics-based technologies, in 2012 Symposium on VLSI Technology
(VLSIT) (IEEE, 2012), pp. 89–90
51. I. Kazi, P. Meinerzhagen, P.-E. Gaillardon, D. Sacchetto, A. Burg, G. De Micheli, A ReRAM-
based non-volatile flip-flop with sub-VT read and CMOS voltage-compatible write, in 2013
IEEE 11th International New Circuits and Systems Conference (NEWCAS) (IEEE, 2013), pp.
1–4
52. W. Kang, Y. Ran, W. Lv, Y. Zhang, W. Zhao, High-speed, low-power, magnetic non-volatile
flip-flop with voltage-controlled, magnetic anisotropy assistance. IEEE Magn. Lett. 7, 1–5
(2016)
53. R. Bishnoi, F. Oboril, M.B. Tahoori, Non-volatile non-shadow flip-flop using spin orbit torque
for efficient normally-off computing, in 2016 21st Asia and South Pacific Design Automation
Conference (ASP-DAC) (IEEE, 2016), pp. 769–774
54. S. Izumi, H. Kawaguchi, M. Yoshimoto, H. Kimura, T. Fuchikami, K. Marumoto, Y. Fujimori,
A ferroelectric-based non-volatile flip-flop for wearable healthcare systems, in 2015 15th Non-
Volatile Memory Technology Symposium (NVMTS) (IEEE, 2015), pp. 1–4
55. K. Ali, F. Li, S.Y.H. Lua, C.-H. Heng, Compact spin transfer torque non-volatile flip flop design
for power-gating architecture, in 2016 IEEE Asia Pacific Conference on Circuits and Systems
(APCCAS) (IEEE, 2016), pp. 119–122
2 CMOS-OxRAM Based Hybrid Nonvolatile SRAM and Flip-Flop … 57

56. H. Li, Z. Jiang, P. Huang, Y. Wu, H.-Y. Chen, B. Gao, X.Y. Liu, J.F. Kang, H.-S.P. Wong,
Variation-aware, reliability-emphasized design and optimization of RRAM using spice model,
in 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE) (IEEE, 2015),
pp. 1425–1430
57. S.K. Kingra, S. Majumdar, M. Suri, Stability analysis of hybrid CMOS-RRAM based 4T-2R
NVSRAM, in 2017 15th IEEE International New Circuits and Systems Conference (NEWCAS)
(IEEE, 2017), pp. 125–128
58. E. Seevinck, F.J. List, J. Lohstroh, Static-noise margin analysis of MOS SRAM cells. IEEE J.
Solid-State Circuits 22(5), 748–754 (1987)
59. E. Grossar, M. Stucchi, K. Maex, W. Dehaene, Read stability and write-ability analysis of sram
cells for nanometer technologies. IEEE J. Solid-State Circuits 41(11), 2577–2588 (2006)
60. C.-F. Liao, M.-Y. Hsu, Y.-D. Chih, J. Chang, Y.-C. King, C.J. Lin, Zero static-power 4T SRAM
with self-inhibit resistive switching load by pure CMOS logic process, in 2016 IEEE Interna-
tional Electron Devices Meeting (IEDM) (IEEE, 2016), pp. 16–5
61. T. Song, W. Rim, S. Park, Y. Kim, J. Jung, G. Yang, S. Baek, J. Choi, B. Kwon, Y. Lee et al.,
17.1 a 10nm FinFET 128Mb SRAM with assist adjustment system for power, performance,
and area optimization, in 2016 IEEE International Solid-State Circuits Conference (ISSCC)
(IEEE, 2016), pp. 306–307
62. M.-C. Chen, C.-H. Lin, Y.-F. Hou, Y.-J. Chen, C.-Y. Lin, F.-K. Hsueh, H.-L. Liu, C.-T. Liu,
B.-W. Wang, H.-C. Chen et al., A 10 nm Si-based bulk FinFETs 6T SRAM with multiple fin
heights technology for 25% better static noise margin, in 2013 Symposium on VLSI Technology
(VLSIT) (IEEE, 2013), pp. T218–T219
63. M.-F. Chang, C.-H. Chuang, M.-P. Chen, L.-F. Chen, H. Yamauchi, P.-F. Chiu, S.-S. Sheu,
Endurance-aware circuit designs of nonvolatile logic and nonvolatile SRAM using resistive
memory (memristor) device, in 2012 17th Asia and South Pacific Design Automation Confer-
ence (ASP-DAC) (IEEE, 2012), pp. 329–334
64. S.-S. Sheu, M.-F. Chang, K.-F. Lin, C.-W. Wu, Y.-S. Chen, P.-F. Chiu, C.-C. Kuo, Y.-S. Yang, P.-
C. Chiang, W.-P. Lin et al., A 4Mb embedded SLC resistive-RAM macro with 7.2 ns read-write
random-access time and 160ns mlc-access capability, in 2011 IEEE International Solid-State
Circuits Conference Digest of Technical Papers (ISSCC) (IEEE, 2011), pp. 200–202
65. S.-S. Sheu, P.-C. Chiang, W.-P. Lin, H.-Y. Lee, P.-S. Chen, Y.-S. Chen, T.-Y. Wu, F.T. Chen,
K.-L. Su, M.-J. Kao et al., A 5ns fast write multi-level non-volatile 1 k bits RRAM memory
with advance write scheme, in 2009 Symposium on VLSI Circuits (IEEE, 2009) pp. 82–83
66. M.-F. Chang, P.-F. Chiu, S.-S. Sheu, Circuit design challenges in embedded memory and
resistive RAM (RRAM) for mobile SoC and 3D-IC, in 2011 16th Asia and South Pacific
Design Automation Conference (ASP-DAC) (IEEE, 2011), pp. 197–203
67. J. Tranchant, E. Janod, L. Cario, B. Corraze, E. Souchier, J.-L. Leclercq, P. Cremillieu, P.
Moreau, M.-P. Besland, Electrical characterizations of resistive random access memory devices
based on GAV4S8 thin layers. Thin Solid Films 533:61–65 (2013)
68. H.-Y. Chen, B. Gao, H. Li, R. Liu, P. Huang, Z. Chen, B. Chen, F. Zhang, L. Zhao, Z. Jiang,
et al., Towards high-speed, write-disturb tolerant 3d vertical RRAM arrays, in 2014 Symposium
on VLSI Technology (VLSI-Technology): Digest of Technical Papers (IEEE, 2014), pp. 1–2
69. S.-Y. Wang, C.-H. Tsai, D.-Y. Lee, C.-Y. Lin, C.-C. Lin, T.-Y. Tseng, Improved resistive switch-
ing properties of Ti/ZrO/Pt memory devices for RRAM application. Microelectron. Eng. 88(7),
1628–1632 (2011)
Chapter 3
Phase Change Memory for Physical
Unclonable Functions

Nafisa Noor and Helena Silva

Abstract Security has become a crucial concern in hardware design due to the grow-
ing need for protection in everyday financial transactions and exchanges of private
information. Physical unclonable functions (PUFs) utilize the inevitable manufactur-
ing process variations to provide a unique way to verify trusted users. Improvements
in attack methods over the years have recently moved the field of PUFs from tradi-
tional silicon devices toward emerging nonvolatile resistive switching memories. Due
to the intrinsic programming variability in resistive switching memory mechanisms,
together with the high endurance of these devices, unpredictable and reconfigurable
PUF challenge-response pairs can be achieved for a very large number of times. In
the case of phase change memories (PCMs), cell-to-cell and cycle-to-cycle program-
ming variability is the result of the random atomic structures created after the rapid
quench from melt during the reset programming and the stochastic distribution and
orientation of seed crystals nucleated in an amorphous plug during the set operation.
This programming variability, which comes in addition to the process variations
present in any technology, is an important advantage of PCM (and other resistive
switching memory technologies) for implementations of PUFs and other hardware
security primitives. In this chapter, we review some of the work on conventional
CMOS-based PUFs, the operation principles of PCM devices, and recent reports on
PCM-based PUFs that utilize programming variability.

3.1 Introduction and Conventional Security Solutions

Security is an inherent need for human life. The security requirements in the current
age, however, have broadened beyond the necessity of protecting tangible property.
In the modern world of internet of things (IoT), countless physical devices, vehicles,

N. Noor (B) · H. Silva


University of Connecticut, 371 Fairfield Way; U-4157, 06269-4157 Storrs, CT, USA
e-mail: [email protected]
H. Silva
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2020 59


M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_3
60 N. Noor and H. Silva

home appliances, medical wearables, factories, smart cities, and many other systems
are intricately being connected through sensors and software and are exchanging
data that must be secured [1]. CISCO has predicted an exponential increase of the
number of connected devices with an estimated 6.58 devices per person on average,
worldwide by 2020 [2]. Hence, any breach of security can lead to significant issues
in the lives of large numbers of individuals by leak of health-related, financial, and
private information and can cause severe economic loss and exposure of confiden-
tiality [3]. Therefore, comprehensive measures need to be taken to secure not only
the cyberspace but also the myriads of connected devices. In this chapter, we describe
conventional and novel methods of securing hardware devices.
In a security system, a securely stored secret key is used along with cryptographic
algorithms. The leak of a secret key means that the security of the system has been
broken [3]. The traditional mechanisms to store keys in devices included permanent
writing of the secret information to a battery-backed static random access memory
(SRAM) array or on a read-only memory (ROM), and using cryptographic operations,
such as digital signature or encryption [4]. The battery-backed SRAM is expensive
in terms of area and power requirement due to the volatile nature of the memory
operation [4] and suffers from limited reliability due to the possibility of battery
failure. Among the nonvolatile key storage solutions, the most common one is the
ROM based, in which masks generate the permanent keys during the manufacturing
stage and these are not erasable or modifiable in the post-manufacturing phase [3].
This technique requires new masks for each new key and thus prolongs the production
time. The major disadvantage of this scheme, however, is that the secret key is always
available in the permanent nonvolatile ROM, even when the device is powered off,
allowing opportunities for invasive or physical attacks. Recent advances in physical
attack techniques on electronic chips via fault analysis tools have made it easier to
produce fake security chips, which can serve as clones and continue to communicate
in the IoT environment. The most commonly used tools for invasive attacks are high-
resolution imaging with optical microscopy or scanning electron microscopy (SEM)
and using destructive measures such as a focused ion beam (FIB) or a laser cutter
to reverse engineer, precisely, layer-by-layer. In micro probing attacks, electrical
measurements can reveal the permanent secret key stored in the permanent memory
[3].
Other nonvolatile key storage solutions have been proposed using floating gate
technologies (flash memory) but their complex fabrication makes it impractical for
PUF application and these are still vulnerable to leak or manipulation of secret key
through micro probing attacks [3]. Incorporation of powered tamper-sensing and
tamper-proof circuits is required to detect or prevent invasive attacks, respectively,
at the cost of additional area and power [4].
The radio frequency identifier (RFID) tag is one of the commonly used products
that stores secret data permanently. The RFID tag also includes an antenna and
communicates with a reader. The reader has its own antenna, it interrogates an RFID
tag with a challenge data signal, and provides the required energy for the tag to
operate. The RFID tag transmits back a response signal incorporating the secret
information permanently stored in the memory. The reader later communicates with
3 Phase Change Memory for Physical Unclonable Functions 61

Interrogation zone Transponder


Data Antenna
Chip

Data

Energy, clock
Host system RFID reader RFID tag

Fig. 3.1 Basic working principle of traditional passive radio frequency identification (RFID) sys-
tem [5]

the host computer for further processing of the communication and secret information
(Fig. 3.1). The information is hard programmed into the RFID memory chip during
manufacturing stages and cannot be erased or modified later. Physical attacks reveal
the secret information and give an adversary the opportunity to produce clone chips.
In addition, the RFID tag is also subject to leak information via eavesdropping, by
which an unauthorized reader listens to the communication between the legitimate
tag and reader to steal information or gain access. The attacker can also record one
part of the communication and conduct a replay attack on the receiving device at
a later time [5]. By observing the pattern of power consumption variations during
the correct and the incorrect passcodes, the attacker can conduct a power analysis
attack, which is a side-channel attack to retrieve secret information. The attacker
can also pursue a man-in-the-middle attack by blocking or manipulating the signal
communication path or carry out a denial-of-service attack by injecting noise and
interference into the network, in order to take the system down [5]. Hence, secret keys
should ideally be written with unclonable schemes and in sufficiently large numbers
to avoid physical attacks.

3.2 Hardware Security with PUFs

3.2.1 PUF Introduction

Considering the abovementioned threats, intrinsic random physical properties of


circuits or devices have been employed, in recent years, to create a distinguishable
and unclonable security primitive. The idea is very similar to the presence of distinct
biometrics in individuals (such as fingerprints, voice pattern, iris, or facial features).
The concept was initially introduced as physical one-way functions (POWF) and
physical random functions, and eventually termed as physical unclonable functions
(PUFs). A PUF is queried with a certain input (challenge) and a measurable output
(response) is generated based on the innate unique physical properties of the devices
or circuits that make the PUF system. A PUF can have one or several challenge(s) and
response(s), which are called the challenge-response pair(s) or CRP(s) (Fig. 3.2a).
62 N. Noor and H. Silva

(a) Physical
Challenge unclonable Response
(C) function (R)
(PUF)

(b) Reproducible (c) Unique (d) Unclonable


Challenge, C1 Challenge, C1
Obs#
1. Legitimate Cloned
PUF PUF
2.

n.
R1 R2 R1 R2

(e) One-way (f) Unpredictable (g) Tamper-evident

Legitimate Tampered
f(C) Given Predict PUF PUF
C R=f(C) Challenge → correct
C f(C)
f-1(R)
Tamper

Fig. 3.2 a Basic working principle of the physical unclonable function (PUF). b–g Schematic
representations of essential features of PUF. Schematics redrawn from [7]

The relation between the challenges and responses or the CRP behavior should not
be easily realizable with mathematical functions and true physical randomness may
ensure this. Hence, a PUF is not a mere mathematical function but rather a procedure
with input–output functionality. Moreover, a PUF is not just an abstract concept
but rather has to be always implemented in a physical entity [6]. PUF CRPs can
be either analog or digital bit strings. For analog CRPs, several stages of decoding
and quantization are required to generate the digital bit string CRPs. A PUF system
should be easy and economical to implement yet very hard to clone. A PUF should
also be easily measurable within reasonable time, effort, power, or area [6].

3.2.2 PUF Security Metrics

There are several essential features that describe the behavior of a PUF, each described
briefly below (Fig. 3.2b–g):
i. Reliability or reproducibility:
The responses generated from the same PUF inquired by the same challenge should
always be very similar during multiple observations (Fig. 3.2b). This feature guar-
antees reproducibility of responses, and a dissimilarity in the responses generated
3 Phase Change Memory for Physical Unclonable Functions 63

from the same challenges is called noise. Reproducibility is a distinct feature of


PUF, which makes it different than a true random number generator (TRNG) [6].
For digital bit string responses, the noise is measured with a hamming distance
which is summarized with histograms. A Gaussian distribution is often used as an
approximation and the mean (μintra ) or the standard deviation (σ intra ) is calculated
to portray the amount of noise in terms of intra-hamming distance. For reliable PUF
responses, μintra is expected to be as small as possible (ideally ~0%). Minimal noise
is expected even for a large range of variations in environmental factors, such as
external temperature, supply voltage, light exposure, or the effect of aging. Since
the environmental effects are systematic, a differential approach is utilized to nullify
the disturbances experienced by all PUFs in case of a linear impact. The responses
generated by two PUFs can either be divided or subtracted to ignore the deviations
in responses due to common environmental contributions. This technique is called
compensation [6] and is one of the measures taken for error correction. A PUF may
use multiple error correction techniques and some of these are vulnerable to leakage
of secret information [4].
ii. Uniqueness:
The responses generated from different PUFs inquired by the same challenge should
be distinguishably different (Fig. 3.2c). The span of the dissimilarity of the responses,
in this case, is mathematically expressed by the inter-hamming distance in terms of
the mean (μinter ) or the standard deviation (σ inter ). For unique PUFs, μinter is expected
to be close to ~50%, implying a random and equal probability of the occurrence of
either state in a two-bit system [6].
iii. Unclonable:
The PUF system should be impossible to clone by an adversary even if he has com-
plete knowledge of a legitimate PUF instance (Fig. 3.2d). The impossibility of cloning
the system can arise from uncontrolled manufacturing variations and/or other physi-
cal properties manifested at the micro- and nano-scale within the devices [7]. Unclon-
ability is the core feature of a PUF. This property includes two aspects—mathematical
unclonability and physical unclonability. A PUF system is mathematically unclon-
able if it cannot be formulated in mathematical models and it is physically unclonable
if it cannot be reproduced due to manufacturing variations or the inherent physics of
device operation. A PUF system should satisfy both unclonability aspects to be truly
unclonable [6].
iv. One-way:
Since the PUF functionality should not be realizable with simple mathematical
expressions, it should be impossible to invert the PUF behavior mathematically or to
estimate an unknown challenge only by observing a given response (Fig. 3.2e) [7].
v. Unpredictable:
Due to the infeasibility of modeling a PUF system, the knowledge of a given challenge
should not release the expected response either (Fig. 3.2f) [7]. A PUF system fails
64 N. Noor and H. Silva

to satisfy this requirement if an adversary, with access to the full PUF, can predict
the upcoming responses for given challenges based on the knowledge he gained
during observations of a set of previous CRPs. In this case, the adversary would
have succeeded to model the PUF system and the cloning would have broken the
unpredictable feature of the PUF [6].
vi. Tamper-evident:
A physical attack on the PUF system should permanently change its functionality
or leave indelible evidence in the device so that further measurements on the device
clearly indicate tampering (Fig. 3.2g) [7].

3.2.3 PUF Classification

Based on the construction and operation principle, PUFs can be divided into the two
broad categories of non-electronic and electronic PUFs (Fig. 3.3).

3.2.3.1 Non-electronic PUFs

The non-electronic PUFs rely on the physical variability of any non-electronic


stochastic system, such as optical, magnetic, or acoustic. Digital techniques are
eventually used to process the raw responses generated by a non-electronic PUF
[6]. Among the non-electronic PUFs, the optical PUF is the most common one, and
optical PUF-like systems were already used in the late 80s for unique identification.
Unique optical reflection patterns from sprayed layers with randomly spaced light-
reflecting particles were used for unclonable identification of weapons [8]. At the
beginning of the last decade, the unique interference pattern created by randomly
spaced light-scattering particles was employed for an optical POWF, which also
introduced the concept of PUF for the first time [9]. The optical token used in this
system was made with a transparent epoxy plate (10 × 10 × 2.5 mm) containing
randomly placed refractive glass spheres (~500–800 µm diameter spheres with an
average spacing of ~100 µm). The token was irradiated with a laser at a certain
angle and the resulting speckle pattern generated at a screen was recorded by a CCD
camera (Fig. 3.4a). The challenge for this PUF system was the angle at which the
laser was shone, and the raw response was the random speckle pattern. Due to the
random spatial arrangement of the glass spheres inside the token, each laser angle
resulted in a unique light-scattering configuration, which in turn produced random
dark and bright spots on the screen. These raw speckle pattern responses were then
post-processed with a Gabor hash function to create a digital bit string output [9].
Despite detailed experimental validations and high security against modeling and
physical cloning attacks of the optical PUF system, it has been later deemed imprac-
tical due to the cumbersome positioning of the optical system and the difficulty of
miniaturizing the design into a compact chip with precise readout mechanisms [11].
3 Phase Change Memory for Physical Unclonable Functions 65

PUFs

Non-
Electronic
electronic

Analog Digital Optical PUF

Threshold Memory-
Delay-based Paper PUF
voltage PUF based

Power CMOS-based Emerging


distribution Arbiter PUF memory NVM- based CD PUF
PUF PUFs PUFs

Ring
Coating PUF oscillator SRAM PUF PCM PUF RF PUF
PUF

Magnetic
LC PUF Latch PUF RRAM PUF
PUF

STT-MRAM
Butterfly PUF Acoustic PUF
PUF

Flip-flop PUF

Fig. 3.3 Proposed PUFs classified in terms of construction and working principle [6]

Designs for the integration of such systems into a chip have later been proposed with
the sequential arrangement of the light source array, the same disordered optical
medium, and the sensor array [10] (Fig. 3.4b, c).
Besides optical scattering from randomly distributed particles or shapes, other
sources of randomness that have been proposed for non-electronic PUFs include
(Fig. 3.3) the following:
i. the unique and random fiber structure of paper for forgery prevention (scanned
for measurement) [12],
ii. the random measured lengths of lands and pits on a regular compact disk (CD)
(measured by the electrical signal generated by photodiode inside the CD reader)
[13],
iii. the random positioning of thin copper wire within a silicon rubber sealant, (the
near-field scattering of electromagnetic waves (5–6 GHz band) was measured
with an RF antenna array) [14],
iv. the unique particle pattern in magnetic media of a swipe card [15],
66 N. Noor and H. Silva

Challenge Response (a)

ϴ1

Gabor
ϴ2
hash
… … CCD function
camera
Speckle 00101101…
pattern on
screen Responses
LASER in digital
orientation bit string
angle (ϴ) Optical
token
Light
scattering
LASER medium

Sensor LCD Sensor


(b) array (c) array array

Phase Light Light


locked Single
scattering laser scattering
laser medium medium
diodes

Fig. 3.4 a Optical physical one-way function (POWF) [9] and b, c later design proposals for
integrated optical PUF. Schematics redrawn from [6] and [10]

v. the characteristic frequency spectrum of an acoustic delay line that converts an


alternating electrical signal into mechanical vibration and back [16].

3.2.3.2 Electronic PUFs

Analog Electronic PUFs

The analog electronic PUFs are based on analog measurements of variability-prone


electronic quantities originating from manufacturer variations and quantization pro-
cesses are applied to the raw analog responses when digital bit string outputs are
desired [6]. Some examples of such quantities from the literature include the follow-
ing:
i. the threshold voltage variation measured in identically designed transistors in
an array [17],
ii. the resistance variation in the power grid of a chip [18],
3 Phase Change Memory for Physical Unclonable Functions 67

iii. capacitance variation in comb-shaped sensors in the top metal layer of an inte-
grated circuit, where a passive dielectric spray containing random dielectric
elements is explicitly introduced (this PUF is also named as coating PUF) [19],
iv. the resonant frequency variation in identically designed LC circuits built by a
glass plate sandwiched between two metal plates, along with a serially chained
metal coil [20].

Digital Electronic PUFs

Digital PUFs output digital bit(s) as responses. There are two major categories—the
digital delay-based PUFs, which include arbiter PUFs and ring oscillator PUFs
and the memory-based PUFs, which are based on conventional CMOS memories or
emerging nonvolatile memories (NVM) (Fig. 3.3).
The arbiter PUF relies on the digital race condition between two symmetrically
designed paths constructed with switch blocks. The switch blocks can be made of
a pair of multiplexers and buffers with two inputs and two outputs in total. Based
on a parameter input bit (0 or 1), the input and the output pairs are connected in
either a straight or a switched fashion (Fig. 3.5a). The challenge for the arbiter PUF
is the sequence of parameter bits that are fed to the serially connected switch blocks.
Due to manufacturer variations, there is always a slight difference between the two
identically designed paths and thus one path becomes a little faster in propagating
the signal. The random small difference between the two delays is received by an
arbiter circuit, which decides which path wins the race by outputting a 0 or a 1 as the
response. The arbiter circuit is made with a latch or a flip-flop [21, 22] (Fig. 3.5a).
The differential nature of the response from the arbiter output cancels out the linear
environmental factors, such as temperature, power supply voltage, or aging effects,
that both delay lines equally experience [6]. If the delay difference between two
paths is too small, the arbiter circuit output will no longer depend on the race but
rather will be determined by random noise, resulting in metastability of the arbiter
and noisy PUF responses [6].
By concatenating numerous switch blocks together, a large bit string is created
as the challenge and hence an exponentially large number of CRPs (2n number of
CRPs for n switch blocks) can be generated, despite the one-bit response [21, 22].
Due to the large number of CRPs, these PUFs are also categorized as strong PUFs
and are used for authentication applications. After being used, each CRP is marked
as “used” in the server database so it cannot be reused, thus avoiding replay attacks
[4].
Due to the linear additive behavior of digital delays in the basic arbiter PUF, it
is possible to model the entire arbiter PUF system mathematically using machine
learning techniques, and accurate predictions can be made about unused CRPs after
observing a certain number of CRPs. This is called model-building attack and it
breaks the security of this PUF [23]. Subsequent works on arbiter PUFs were intended
to make model-building attacks difficult. XOR-arbiter PUFs [24] and feedforward
arbiter PUFs [25] are two examples of such improvements based on introduction
68 N. Noor and H. Silva

Fig. 3.5 Delay-based CMOS PUFs: a arbiter PUF and b, c ring oscillator PUF with division and
comparator compensation techniques. Schematics redrawn from [6]

of nonlinearity to the delay lines. For the feedforward arbiter PUF, several input
challenge bits were received from the outcomes of some randomly placed interme-
diate arbiter circuits. However, these improved arbiter PUFs were shown to still be
vulnerable to more advanced model-building attack techniques [25, 26].
Ring oscillator (RO) PUFs also rely on delay deviations [23]. In an RO circuit,
the output of a digital delay line is fed back to the input to create an asynchronous
oscillating loop. Due to the manufacturer variations, the delay is random on dif-
ferent identically designed circuits, which in turn determines the resulting random
3 Phase Change Memory for Physical Unclonable Functions 69

frequency of the oscillation. The frequency is measured with an edge detector and
a digital counter circuit connected at the output of the RO. A parameterizable delay
setting is used as the challenge for this PUF and the measured frequency value at
the counter output is used as the analog response for the basic RO PUF. However, as
the resulting frequency greatly depends on temperature and power supply voltage,
the PUF responses become noisier due to fluctuations in environmental fac-
tors. Therefore, compensation techniques are implemented by either dividing or
subtracting the output frequency values from a pair of ring oscillator PUFs [23]
(Fig. 3.5b, c). The type of delay circuits used for RO PUFs is the same as of those
used in arbiter PUFs circuit, and hence similar model-building attacks are possible
[21, 22]. Moreover, an unexpected high correlation exists between the responses
generated from (1) the same challenge inquired on different FPGAs and from (2)
the different challenges inquired on the same FPGA [6]. In later works, only one out
of eight pairs of ROs used is selected to improve the uniqueness and reproducibility
features of RO PUF. This technique is termed as 1-out-of-8 masking [4].
PUFs based on conventional CMOS memory rely on the settling state of a desta-
bilized digital memory circuit (Fig. 3.6). A digital memory cell has two or more
logical states, and, in normal operation, it can be programmed to one of these stable
states and be used for information storage. However, if the memory cell is brought
to an unstable state, it may start oscillating between the possible stable states and,
after a certain time, converge to a preferred state, depending on the uncontrolled
physical mismatch caused during manufacturing [6]. This concept has been used to
implement PUFs with SRAM, latch, and flip-flop circuits.
The SRAM cell is made of two cross-coupled inverters consisting of four metal
oxide semiconductor field effect transistors (MOSFETs) along with two additional
access MOSFETs. Due to the inevitable manufacturer variations, both halves cannot
be made exactly identical, and hence each SRAM cell has a slight inclination toward
one of the two logical states (0 or 1) at power-up condition. For SRAM PUF, the
powering up is the challenge, and the one-bit settling state is the response, resulting
in a single CRP per cell [27] (Fig. 3.6a). Due to the limited number of CRPs and the
linear relation between the CRP size and the number of components, these PUFs are
also categorized as weak PUFs and are used for key generation applications [4].
Very similar concepts have also been applied as follows:
i. latch PUF, where two cross-coupled NOR gates are brought to an unstable state
by a reset signal (as the challenge) and the settling state is observed (as the
response) [28] (Fig. 3.6b),
ii. butterfly PUF, where two cross-coupled latches are brought to an unstable state
by a clear/preset function (as the challenge) and the settling state is observed
(as the response) [29] (Fig. 3.6c),
iii. flip-flop PUF, where power-up condition (challenge) results in a settling state
(response) [30].
70 N. Noor and H. Silva

Fig. 3.6 Memory-based CMOS PUFs: a SRAM PUF, b latch PUF, and c butterfly PUF. Schematics
redrawn from [6]

3.2.4 Advantages and Disadvantages of Different PUFs

All the abovementioned PUF technologies, along with the permanent key storage
schemes, can generate unique identifiers or keys. The permanent key storage schemes,
however, require hard programming during manufacturing to generate the keys and
are vulnerable to physical cloning. For PUFs, in contrast, the uncontrollable process
variations prevent manufacturing an exact physical copy [6].
The delay-based PUFs (basic arbiter, feedforward arbiter, and the ring oscillator
PUFs) are prone to model-building attacks and thus fail the unpredictability require-
ment of PUF. Even though the CRP size is exponentially large for the arbiter PUFs, the
security is in peril when even a relatively small number of CRPs have been observed
by the attacker. On the other hand, the CMOS memory-based PUFs (SRAM, latch,
and butterfly PUFs) can be read exhaustively and the entire CRP database will be
known to the attacker. For these memory-based PUFs, the number of CRPs is linear
with the number of cells in the array, and thus it is easier for an attacker to accomplish
the knowledge of the entire CRP database [31]. The same problem also exists for the
coating PUF and ring oscillator PUF with comparator compensation and 1-out-of-8
masking. Hence, mathematical cloning of all these CMOS-based PUFs can be made
even though physical cloning of such technologies is impossible. Once manufac-
tured, the physical mismatch that determines the preferred output state for a given
3 Phase Change Memory for Physical Unclonable Functions 71

challenge stays unchanged for the lifetime of these PUFs. Hence, there is no option
to refresh the CRPs for these PUF technologies [6].
To increase security against mathematical cloning, controlled PUFs (CPUFs)
have been proposed, in which the PUF is complemented with cryptographic algo-
rithms. In CPUF, the PUF is accessed only by the algorithm. A cryptographic hash
function is used to generate randomly picked challenges so that model-building
attacks can be avoided (although this method cannot thwart the model-building
attack on arbiter PUF). The random challenges are used to interrogate the PUF and
the generated responses are then fed to an error correction code (ECC) to improve
the reliability or minimize noise. The output of the ECC is then inputted to another
cryptographic hash function which breaks the link between the responses and the
actual physical details of the PUF measurements [6, 23] (Fig. 3.7a).
Another way to increase the security of the system has been accomplished by
reconfigurable PUFs (RPUFs). In an RPUF, the partial or complete CRP can be
refreshed irreversibly, and thereby a completely new PUF is created after every
refresh. The RPUFs are categorized into two types—logically and physically recon-
figurable PUFs (L-RPUFs and P-RPUFs). In L-RPUF, the responses are interfaced
with a multiplexer, control logic, or control query algorithm for the reconfigura-
tion purpose (Fig. 3.7b) [32]. In P-RPUF, in contrast, the responses are intrinsically
altered due to the physical mechanism involved in refreshing the material properties,
which is not only more efficient than L-RPUF in terms of area but also more secure
against tampering because of the physical origin of stochasticity (Fig. 3.7c) [33, 34].

(a)
C R
Cryptographic Error Correction Cryptographic
PUF
Hash Function Code (ECC) Hash Function

Generates Makes the Breaks the link between


randomly picked responses more the responses and
challenges reliable physical details of PUF

(b) (c) C R
Reconfiguration
PUF

C Cint Rint R Reconfiguration


Control Control
PUF
logic logic C R’
PUF

Fig. 3.7 Working principles of a controlled PUF [6], b logically reconfigurable PUF [33], and
c physically reconfigurable PUF [33]. C, R, R’ stand for challenge, response, and reconfigured
response. C int and Rint refer to intermediate challenge and responses
72 N. Noor and H. Silva

3.3 Nanodevice-Based PUFs

CMOS-based PUFs received significant attention and were the focus of rigorous
research efforts for many years. However, the unaddressed challenges of overcoming
mathematical clonability have recently shifted the focus in the PUF field toward novel
nanotechnologies and nanomaterials. Moreover, CMOS technology is approaching
its scaling limits and new technologies are emerging to continue to deliver perfor-
mance improvements with smaller devices [35].
For storage applications, various kinds of resistive switching memory technolo-
gies have been demonstrated with promising speed, endurance, retention time, scal-
ability, and energy efficiency. The most progress has been made in the fields of phase
change memory (PCM), resistive random access memory (RRAM), and spin-transfer
torque magnetic random access memory (STT-MRAM) technologies. These nanode-
vices offer easy fabrication with simple cell structures (Fig. 3.8) [36]. All these
technologies incorporate compact two-terminal devices relying on resistive switch-
ing. These memory devices can be reversibly programmed to various resistance
levels by suitable electrical pulses. The programmed states are easily distinguishable
as distinct states and are stable under normal operating conditions (such as room
temperature and supply voltage) assuring long data retention time.
These novel memories can potentially produce lightweight, robust, secure, and
reconfigurable PUFs and other security primitives to meet the next-generation secu-
rity challenges. An intrinsic property of these nanodevices, variability, typically a
disadvantage for memory implementation, is an important advantage for PUF appli-
cations (in addition to the existing process variation present in any technology). The
programming variability is observed on the same cell for different cycles of operation
(cycle-to-cycle variability) as well as on different cells for the same programming
conditions (cell-to-cell variability). As the randomness originates from the stochas-
tic rearrangement at the atomic scale, it is impossible to formulate or predict the
variability pattern. The cycle-to-cycle programming variability allows the recon-
figurability feature for a PUF since new CRPs are obtained after each reprogram-

(a) (b) (c)


Top electrode Top electrode Top electrode
Crystalline Chalcogenide Insulator Fixed
layer (FL)
Amorphous Barrier layer MgO
Filament Random
layer (RL)
Oxide Oxide

Bottom electrode Bottom electrode Bottom electrode

Fig. 3.8 Schematics of typical cell structures for a phase change memory (PCM), b resistive
random access memory (RRAM), and c spin-transfer torque magnetic random access memory
(STT-MRAM) cells (not drawn to scale)
3 Phase Change Memory for Physical Unclonable Functions 73

ming. The cell-to-cell programming variability in nanodevices, on the other hand,


is the result of the innate programming variability as well as the process-related
variations. The resistance variation observed at either of the programmed states is
also a source of variability for PUFs utilizing these nanodevices [37, 38]. For the
resistance-variability-based PUF, each challenge does not require a programming
operation and a low-voltage read operation is enough. In contrast, the programming-
variability-based PUFs require a programming pulse for each challenge inquiry,
which is expensive in terms of power. However, the reconfigurability feature is the
attractive feature for the programming-variability-based PUFs, by which the uncer-
tainty of the same PUF is refreshed with each new reprogramming to thwart physical
attacks [33].

3.3.1 Phase Change Memory

PCM was first introduced by Ovshinsky in the late 1960s with the Ovonic threshold
switch (OTS) phenomenon [39], which also showed promise for repeated memory
operation [40]. However, the low programming speed and the high programming
energy obtained from the prototype devices [41] waned interest in PCM as an elec-
tronic memory and rather deviated the following research initiatives toward the opti-
cal data storage field during the 1990s and 2000s [42]. In the early 2000s, advances
in PCM materials with improved scalability, speed, and resistivity contrast led to
renewed interest in PCM. PCM was then envisioned as a “universal memory” that
could potentially replace both DRAM and NAND flash. [43]. However, the high reset
current in PCM hindered the scaling pace to compete with NAND flash and the writ-
ing speed and endurance could not reach DRAM standards either. Considering all the
progress and remaining limitations, PCM is now regarded as storage class memory
(SCM), a complementary technology to bridge the latency gap between NAND flash
and DRAM (Fig. 3.9) [43, 44], together with RRAM and MRAM. PCM can either
serve as storage-type SCM, for which high density is the main requirement or as
memory-type SCM, for which high endurance (≥1012 ) and high reset and set speeds
(<50 ns) are the critical requirements. Multi-level cell (MLC) storage (e.g., 2 bits per
cell with four distinct resistance states) and further progress in scaling, beyond the
4–6 F2 cell with sub-10 nm feature size, can lead to further improvements of PCM
[43, 45].
PCM stores information on the phase of a chalcogenide material that can be
reversibly switched between two (or more) stable states with distinct resistance levels.
The high resistance state (HRS) in PCM is the amorphous or reset state and the low
resistance state (LRS) is the crystalline or set state (Fig. 3.10). The most used and
studied phase change material has been the Ge2 Sb2 Te5 compound (GST-225) due to
its crystallization speed, stability, and resistivity contrast between the amorphous and
crystalline phases. In a typical PCM cell, the GST material is sandwiched between
two contacts. The mushroom cell has been the standard cell, in which the phase
change material is placed above a nanoscale bottom contact, called the heater, since
74 N. Noor and H. Silva

Memory Memory + Storage Storage


NAND Solid State Hard Disc Drive
SRAM DRAM PCM
Drive (SDD) (HDD)

1X 10X 102X 105X 107X

Latency

Fig. 3.9 Comparison of the latency of different memory and storage technologies. PCM along with
RRAM and MRAM are competing to bridge the latency gap between the conventional memory and
storage devices [46]

Voltage Reset
Temperature pulse
Tmelt

Set pulse Tcrystallization

Read Read Read


pulse pulse pulse
Time
Reset Set
Crystalline Amorphous Crystalline

Fig. 3.10 Schematics of the reset (amorphization), the set (crystallization), and the low-voltage
read operations in PCM [47]

it defines the highest current density region for Joule heating and the minimum
amorphous plug required to block conduction in the high resistance state [47, 48].
For amorphization (reset), an amorphous plug is created on the phase change
material just above the heater (also called the active region) by rapid cooling after
melting at or above ~900 K (melt-quench) by a high amplitude short electric pulse
(typically <50 ns) terminated abruptly [43]. The rapid quench results in random
atomic rearrangements upon solidification and leaves the material in its amorphous
phase (Fig. 3.11a). For crystallization (set), an electric pulse with moderate amplitude
and longer duration (100 ns − 10 µs [43]) brings the active region above the crys-
tallization temperature (~500 K) yet below the melting temperature, for a sufficient
period to allow recrystallization (Fig. 3.11b).
The high reset current requirement in mushroom cell necessitates large selector
device, blocking the path toward further scaling [43]. Later cell designs showed
3 Phase Change Memory for Physical Unclonable Functions 75

(a) (b)
Top electrode Top electrode
Crystalline Crystalline
Amorphous
plug

Bottom electrode Bottom electrode

Fig. 3.11 a Crystalline or set state and b amorphized or reset state in a mushroom cell

improved power consumption and reliability, and currently, the preferred cells are
confined cells which improve thermal confinement and minimize the active region
that must be amorphized and crystallized [49]. PCM operation has been demon-
strated for active regions as small as ~1 nm using a carbon nanotube as the critical
contact [50], thus demonstrating PCM high scalability potential compared to other
traditional memory technologies. Alternative compositions of GST or other chalco-
genide materials, as well as alternative deposition methods such as atomic layer
deposition (ALD) and chemical or physical vapor deposition (CVD or PVD), have
also shown improved and promising endurance [51], low reset currents [52], fast
crystallization, and good scalability.
The large memory window in PCM (resistance contrast between the amorphous
and crystalline phases) enables MLC storage, by which programming to multiple,
distinct resistance levels between the full reset and set states allows for storage of
more than one bit per cell for a significant increase in memory density (Fig. 3.12a)
[53]. The intermediate states are achieved by partial reset or partial set operations
either with a single pulse, with amplitude in between that of the full reset and the
full set pulses, or gradually by applying repeated smaller amplitude reset or set
pulses [54–56]. By applying appropriate voltages to the perpendicularly arranged
word line and bit line metals, the PCM cell at the cross-point is addressed in the
crossbar architecture (Fig. 3.12b) [46, 47]. Besides device downsizing, MLC, and
crossbar architecture, high density is also achieved with the 3D stackability of PCM
(Fig. 3.12b).

3.3.2 Useful Properties of PCM for PUFs

Despite the tremendous progress made in the PCM device research for the past few
years, PCM has the cell-to-cell and cycle-to-cycle resistance variability issue. The
76 N. Noor and H. Silva

(a) (b)
3D
106-7 Amorphous or high stackable
resistance state (HRS)
Bit line
Selector
Resistance (Ω)

Memory cell
Mixed states or Word line
intermediate states

Crystalline or low
102-3 resistance state (LRS)

Fig. 3.12 a Large memory window in PCM cells enable multi-level cell (MLC) operation with
highly dense mixed states. b 3—dimensionally integrable crossbar architecture for high-density
PCM [47]

variability can occur during both the reset and set operations, even if the pulse parame-
ters are chosen for successful switching. This phenomenon is known as programming
or switching variability and it can be more severe in the weak programming regimes,
i.e., in partial reset and partial set operations. An attempt to program a cell toward
the partial reset regime from the crystalline state may not always be successful and a
moderate pulse chosen for this operation may leave the cell unchanged, at the initial
crystalline state, or instead program it fully into the high resistance amorphous state
[57]. These unpredictable partial programming operations in PCM limit memory
performance but enable valuable implementations for PUF applications.
The cycle-to-cycle stochastic nature in the reset operation originates from the
random atomic rearrangement that takes place during the rapid melt-quench process
[58]. On the other hand, the cycle-to-cycle stochastic nature in the set operation stems
from the random spatial arrangement of the seed crystals that remain or nucleate after
amorphization and from where crystallization will proceed [59, 60]. Moreover, the
initial state for either operation also varies depending on the history of the cell. For
example, the initial crystalline resistance of the same cell is an important factor on
the result of the following operations, and compounds on the overall variability. In
case of the cell-to-cell programming variability, for both reset and set operations,
process variations also add to the intrinsic cell variability. Process variations include
geometry variations (thickness, length, or width of the active regions and contacts)
and local material variations.
Due to spontaneous resistance drifts, PCM cells do not remain at the programmed
resistance levels, and the drift history of cells also add to the variabilities in fol-
lowing programming cycles. The hexagonal close packed (hcp) phase is the stable
phase of GST and does not experience drift but devices typically operate between
the metastable amorphous and crystalline face-centered cubic (fcc) and both phases
experience resistance drifts. The crystalline fcc state shows a slight upward resis-
tance drift within typical time scales and for longer periods it slowly transitions to
3 Phase Change Memory for Physical Unclonable Functions 77

Fig. 3.13 a Experimental results of resistance drift observed from GST-225 line cells at various
temperature levels. b Drift coefficient measured from the resistance versus time bilogarithmic plot
measured from a line cell at 400 K [65, 67]

the stable hcp phase with a decrease in resistance. The amorphous phase, on the other
hand, shows a significant steady upward resistance drift at the beginning, which sat-
urates after a certain time (depending on temperature) after which the resistance also
decreases until complete crystallization. The resistance drift trends accelerate with
temperature causing faster data loss at higher temperatures [61–65] (Figs. 3.13a
and 3.14b). The upward resistance drift in amorphous phase follows a power-law
behavior and the slope of the bilogarithmic resistance–time plot is called the drift
exponent or drift coefficient [66] (Fig. 3.13b). Higher drift coefficient indicates a
faster increase of the amorphous resistance during the upward resistance drift. The
drift coefficient depends on the temperature [65–67], read current [66], and the pro-
grammed resistance level [58]. Drift itself is a stochastic process in PCM (Fig. 3.14a)
and has been explained as related to structural relaxation of the amorphous material
after the rapid melt-quench process [66] and to charge trapping and detrapping from
incipient nuclei during early crystallization [67, 68] (Fig. 3.15).
PCM cells also experience read disturb when a read operation, with higher than
normally applied current, results in localized heating which can cause a perturba-
tion of the cell state and thermal cross-talk or program disturb when programming
neighboring cells can also result in a sufficient increase in temperature that can dis-
turb the state of the cell [69]. PCM devices also exhibit various types of noise and
fluctuations in electrical current measurements. Random telegraph noise (RTN) has
been observed at intermediate states (~600 k resistance level of µ-trench cells [70])
when read with a certain voltage and at a certain ambient temperature. The RTN has
been ascribed to the same physical mechanism causing the amorphous resistance
drift, i.e., structural relaxation. The random possibility for a defect to reside on one
of the two equally energetically favorable states is reported for the possible mecha-
78 N. Noor and H. Silva

Fig. 3.14 a Experimental results of spread of drift coefficients and its dependency on temperature.
The small scatter points are the drift coefficients measured on different cells and the large scatter
points are the average of all values measured at the corresponding temperature. b Dependency of
crystallization time on temperature [65, 67]

(a) (b)
Top electrode Top electrode
Crystalline Crystalline

Bottom electrode Bottom electrode

Fig. 3.15 Schematic example of random spatial arrangement of the seed crystals nucleated inside
the amorphous plug resulting in random resistance evolution over time during the crystallization
process in PCM mushroom cell. Schematic redrawn from [60]

nism for RTN [70]. Moreover, current measurements at crystalline and amorphous
states of PCM show random fluctuations of current, known as the 1/f behavior or the
flicker noise [71].
The programming variability, resistance drifts, read disturb, thermal cross-talk,
and RTN noise in PCM devices cause reliability problems for memory implemen-
tations but can be utilized for hardware security applications such as PUFs or true
random number generators (TRNGs).
3 Phase Change Memory for Physical Unclonable Functions 79

3.3.3 PCM-Based PUF Reports

3.3.3.1 Concept-Only

Among the very few PCM-based PUFs reported, the first one was a concept-only
work, by Kursawe et al. in 2009 [34], with the introduction of RPUF (Reconfigurable
PUF) for the first time. According to the authors, the read process in PCM is much
more controlled than the writing process. Thus, the randomly programmed state in an
MLC scheme can generate the random multi-bit responses and can be reconfigured
as a new PUF, with a refreshed set of CRPs, by reprogramming the cell every time.
This work described the idea only, with no simulation or experimental validation.

3.3.3.2 Simulation

The following two works on PCM-based PUFs were published by Zhang et al. in
2013 and 2014. The first one, PCKGen or PCM cryptographic key generator, was
demonstrated with simulation [72], while the second one showed detailed experi-
mental validation of slightly different approaches [33]. For this PUF, the challenge
was the addresses of a pair of PCM cells that were reconfigured, and the response
was the comparison between the cell resistances. The first paper [72] was based on
three core points:
I. Due to the natural log-normal distribution of the PCM cell resistance, a log-
arithmic amplifier (LogAmp) was employed to the cell read path to reshape
the cell resistance distribution in the linear domain. This method removed the
undesired bias in the bit pattern of the cryptographic key output and maximized
the entropy.
II. An imprecisely controlled current-pulse regulator (ICCR) was used to generate
probabilistic current pulses to reconfigure the PCM cells. The ICCR incorpo-
rates a current-mode digital-to-analog converter (DAC) and a pulse shaper.
These two circuits generate m-bit and n-bit digital bit strings to control the
amplitude and the duration of the applied current pulses to the PCM cells,
respectively. A similar approach was also used in their following experimental
work [33], with voltage pulses instead of the current pulses for programming.
III. A post-processing module (PPM) was used to improve the raw response qual-
ity. The authors indicated that the raw responses might incorporate noise and
might not be truly random. The PPM helped the responses to be unpredictable
and stable over time. Error correction code (ECC) with helper data (stored in
nonvolatile PCM cells with well-maintained security) and a subsequent hash
function were used to improve the response quality.
A numerical PCM model [73] was used for the simulations performed in this work
along with the simulations of the auxiliary CMOS circuits with Cadence 90 nm design
environment. The simulation of the security analysis showed improved bias with the
80 N. Noor and H. Silva

LogAmp, a reduced error rate with ECC, and ~50% of uniqueness with the hash
function [72].

3.3.3.3 Experimental Validation

In their experimental validation, 180-nm PCM cells were used with both single and
repeated pulsing attempts separately. The address of a PCM cell was used for the
challenge and the resulting resistance upon the reconfiguration or reprogramming
was the response for this PUF [33].
For the single pulse programming approach, the partial reset programming method
was used, which requires less programming time as compared to the partial set oper-
ation. Using staircase down (SCD) pulses, the entire PCM array was initialized to the
full set state to maintain similar initial conditions for all cells prior to the program-
ming. The PCM cells were then partially reset with single rectangular applied pulses
with variable pulse amplitudes and durations (nonuniform programming scheme in
Fig. 3.16). The authors indicated that the programmed resistance of the identical
PCM cells should only depend on the fabrication process variations and the applied
programming pulse parameters. However, this overlooks the inherent programming
variability originating from the nanoscale switching mechanism in PCM, which is
a key contributing factor for the cell-to-cell programming stochasticity, besides the
unavoidable process variations [33].
The random pulse parameters for the nonuniform programming were generated
using an imprecisely controlled regulator (ICR), as in their simulation-based work,
which was argued to be a more secure scheme against physical attacks as compared
to the conventional hash-mode and TRNG-mode methods with digital-to-analog
converters (DAC). The ICR circuit includes a programming voltage generator and a
pulse generator [33] (Fig. 3.17).
An m-bit binary input string can produce 2m possible configuration states for the
programming voltage amplitudes and an n-bit binary input string randomly selects
the delay chain from n sets of inverter delay chains; hence, 2n possible pulse durations
can be generated [33].

Progs (A, t)
PCM- Resistance variation due to
(a) RPUF (1) fabrication variations only
Challenge Response

Progs1 (A, t),


Progs2 (A, t)
PCM- Resistance variation due to
(b) RPUF (1) fabrication variations and (2) pulse parameters
Challenge Response

Fig. 3.16 Using the a uniform and b nonuniform programming pulse parameters for achieving
spread of programmed resistance for PCM-based RPUF [33]
3 Phase Change Memory for Physical Unclonable Functions 81

(a) (b)
RPUF RPUF
C R C R

S
DAC Hash DAC TRNG
S

(c)
RPUF
C R

DAC ICR Hash


S

Fig. 3.17 a, b The conventional approaches of generating random configuration states with hash-
mode and TRNG-mode along with digital-to-analog converters (DAC). The configuration state S is
prone to physical attack for both cases. c Random configuration state generation using imprecisely
controlled regulator (ICR) along with DAC is reported to be more secure [33]

The large number of random challenges (2m × 2n ) generated by concatenating


the two input strings from the ICR is applied to program or reconfigure the PCM
cells. Even though the challenge bits get revealed from a limited number of CRP
observations due to the linear behavior of the voltage and pulse generator circuits,
the inherent physical variability in PCM programming will not allow an attacker
to predict the response. The reliability test done at different temperatures showed
that the intra-hamming distance measured at higher temperature is larger than the
desired small values. Thus, several ECCs were considered to improve uniformity.
The unpredictability test conducted on different cells on the array also showed slight
skew (~30%) and the unclonability test performed on different chips showed ~40%
inter-hamming distance. As the raw responses were the resulting programmed resis-
tance values, which depend on the physical uncertainty in the partial amorphization
mechanism in PCM cells, these skews are expected. The raw responses were post-
processed using a hash function in the fuzzy extractor to improve the unpredictability
and unclonability features of the PUF [33].
This design included not only large numbers of CRPs but also the refreshability
feature of each PCM cell which renders a new set of CRPs after each reprogramming.
Hence, it results in a PUF with strong unpredictability or mathematical unclonabil-
ity against model-building attacks [33]. However, the authors indicated that attacks
through exhaustive reading attempts in between two consecutive reconfigurations
are still possible and need to be prevented with additional features. Two possible
measures were proposed to address this problem [33]:
82 N. Noor and H. Silva

i. Periodic refresh: the CRPs are refreshed in a periodic fashion after every t R
duration. A clock can be used to monitor the time elapsed.
ii. Frequency-based refresh: the CRPs are refreshed after a certain number of evalu-
ations nR take place. A counter can be used to monitor the number of evaluations.
In the same work, an alternative partial reset programming strategy with grad-
ually increasing repeated pulses (termed as staircase up or SCU) was also used
along with a program-and-verify scheme. In this programming method, the cell resis-
tance was measured in between the consecutive programming pulses and compared
with a target programmed resistance (Rtarget ) to determine if additional pulses were
required. Based on the manufacturing variations, as indicated by the authors, and we
add here—in addition to the programming variability itself—the number of pulses
required to reach a certain Rtarget varied randomly. Moreover, the use of the varying
step incremental voltage and the varying pulse durations introduced additional vari-
ability to the responses. The challenge for the PUF was again the cell address and the
raw response was the number of pulses required to generate a certain Rtarget . The raw
responses were post-processed for quantization of the output into digital bits. The
error probability in the PUF output increased with an increasing number of output
bits used for quantization. By increasing the programming time with the gradual
reset method, the total programming time was made equal to the time between two
consecutive reconfigurations in the single pulse programming method. Hence, the
number of accessible CRP between two reconfigurations was only one, significantly
improving the security of the repeated pulse programming as compared to the single
pulse programming [33].

3.3.3.4 TRNG (Simulation Based)

A recent work has discussed the dependence of programming variability on the PCM
cell design and how this can be leveraged for hardware security purposes [74]. This
paper reports that since in mushroom cells or µ-trench cells the active region (volume
that is switched between amorphous and crystalline) is adjacent to the contact, the
switching mechanism is strongly dependent on the shape of the heater–chalcogenide
interface and on the heater material used [75, 76]. These cells were observed to have
smaller programming variability and the process variations (mostly on the heater
definition) are likely to dominate over the variability associated with the switching
location within the PCM material. In contrast, in line cells, the active region is within
the phase change material, away from the contacts, and the larger variability in these
cells is due mostly to the switching variations within the PCM material [77, 78]. The
authors indicated that the variability in PCM line cells can be even larger than that
observed for oxide-based resistive RAMs which are known for large programming
variability [79]. The authors have simulated the switching from the full reset state
using a precalibrated voltage for a large number of PCM line cells. The cells fail or
succeed to switch toward the crystalline resistance state with equal probability and
this random switching was proposed for TRNG implementations.
Table 3.1 Comparison of various PUF properties for different technologies. Table adapted from [6]. The PCM-based PUF properties have been added here as
the last row
PUF name Randomness Challenge Response Tamper Unique? Re- Physically Mathematically Unpredictable?
type evident? producible? unclonable? unclonable?
RFID-like Secret key is Interrogation Permanent – Yes Yes No No No
protocol explicitly hard by a reader secret key
programmed
during
manufacture
Optical PUF Explicitly Laser Gabor hash of Yes Yes Yes Yes Yes Yes
introduced orientation speckle
pattern
Coating PUF Explicitly Sensor Quantized Yes Yes Yes Yes Noc Yes
introduced selection capacitance
Basic arbiter Implicit Delay line Arbiter – Yes Yes Yes Nod !f
PUF manufacturer setting decision
variability
Feedforward Implicit Delay line Arbiter – Yes Yes Yes Nod !f
arbiter PUF manufacturer setting decision
3 Phase Change Memory for Physical Unclonable Functions

variability
RO PUF Implicit Delay line Frequency – Yes Yes Yes Nod !f
w/division manufacturer setting division
variability
RO PUF Implicit Loop pair Frequency – Yes Yes Yes Noc Yes
w/comparator manufacturer selection comparison
and 1-out-of-8 variability
mask
SRAM PUF Implicit SRAM Power-up state – Yes Yes Yes Noc Yes
manufacturer address
variability
(continued)
83
Table 3.1 (continued)
84

PUF name Randomness Challenge Response Tamper Unique? Re- Physically Mathematically Unpredictable?
type evident? producible? unclonable? unclonable?
Latch PUF Implicit Latch Settling state – Yes Yes Yes Noc Yes
manufacturer selection of destabilized
variability latch
Butterfly PUF Implicit Cell selection Settling state – Yes Yes Yes Noc Yes
manufacturer of destabilized
variability cell
PCM-PUF Implicit Cell selection Programmed Yesa Yes Yesb Yes Yese Yes
based on manufacturer or not?
programming variability and
variability Programming
variability
a Tamper evidence feature can be implemented using the read disturb property and thermal cross-talk of PCM (if read voltages are sufficiently high)
b For a PCM-PUF based on the variability of resistance of cells programmed to either state, the CRP can be reproduced. The CRPs based on amorphous and
crystalline fcc states can, however, become noisy at elevated temperatures and also overtime due to resistance drifts (these do not affect the stable crystalline
hcp states). For the programming-variability-based PCM-PUF, the challenge is a programming pulse after an initializing crystallization procedure. It might not
be possible to obtain the same programming result on consecutive trials of the same applied programming pulse on the same PCM cell to check reproducibility
if weak programming strategy is used to randomly pass or fail a cell to program. However, the reading operations done before the next programming cycle will
retain the state due to nonvolatility and long retention time, and hence the reproducibility of the state during read can be ensured for a certain programming
cycle. Moreover, in the programming-variability-based PCM-PUF, the CRPs can be refreshed by reprogramming the cell (periodically or after a certain number
of measurements) to prevent physical attacks and this can be done for a very large number of times, limited by cell endurances, >~1010 cycles)
c By exhaustively reading out all CRPs
d By model-building attack
e PCM programming variability depends on complex stochastic switching mechanisms within the phase change material that cannot be predicted or modeled

exactly
f When an adversary learns more and more CRPs, it becomes increasingly easier to predict the unseen CRPs
N. Noor and H. Silva
3 Phase Change Memory for Physical Unclonable Functions 85

3.3.4 PCM-PUF Versus Other PUFs

Table 3.1, adapted from the review by Maes and Verbauwhede in 2010 [6], compares
the main PUF properties for different technologies that have been considered. We
have added the security performance properties of PCM-based PUFs as the last row.
It is interesting to note that PCM and optical PUFs are the only types considered to
be mathematically unclonable, an important feature for high-security applications.
It is also important to emphasize here that PCM-based PUFs (and other emerging
NVM-based PUFs such as RRAM-PUFs or STT-MRAM-PUFs) are reconfigurable
PUFs which also enables higher security implementations.

3.4 Outlook

Phase change memory provides a new platform for hardware security primitives
such as PUFs and TRNGs. These applications are enabled through several unique
properties of PCM:
– programming variability due to complex mechanisms behind amorphization and
crystallization processes of phase change materials that cannot be reproduced,
predicted, or modeled exactly;
– high endurance devices that enable very large (practically unlimited) sets of
refreshed, nonvolatile challenge-response pairs;
– read disturb and thermal cross-talk effects that can potentially be used for tamper
evidence schemes;
– resistance drifts of the amorphous and crystalline fcc states that add to the intrinsic
programming variability in following cycles.
Although extensive research efforts have made PCM the well-established mem-
ory technology it is today, most have focused on improving memory performance,
which includes minimizing variability, and only a few reports have discussed the
intentional use of variability for hardware security. Further research on PCM mate-
rials and devices focusing on hardware security applications is therefore needed to
better understand and utilize variability in these devices. Suitable electrical charac-
terization and data analysis techniques are also needed to study variability in PCM
devices, as well as in other emerging nonvolatile nanoscale devices, RRAM or STT-
MRAM, which also offer promising properties for reconfigurable hardware security
primitives.

Acknowledgements This work has been funded by the Air Force Office of Scientific Research
(AFOSR) through award FA9550-14-1-0351Z (MURI: Universal Security Theory for Evaluation
and Design of Nano-scale Devices and for Development of Innovative Security Primitives). The
authors would like to thank the members of the Nanoelectronics Laboratory at University of Con-
necticut and of the MURI team, with special thanks to Fahim Rahman and Bicky Shakya from
University of Florida and Chenglu Jin and Phuong Ha from University of Connecticut for valu-
86 N. Noor and H. Silva

able discussions on hardware security primitives, and Professor Ali Gokirmak from University of
Connecticut for his help in device physics understanding.

References

1. S. Sicari, A. Rizzardi, L.A. Grieco, A. Coen-Porisini, Security, privacy and trust in Internet
of Things: The road ahead. Comput. Networks 76, 146–164 (2015). https://doi.org/10.1016/J.
COMNET.2014.11.008
2. D. Evans, The internet of things: how the next evolution of the internet is changing everything.
Cisco Internet Bus. Solut. Gr. 1(2011), 1–11 (2011)
3. H. Handschuh, G.-J. Schrijen, P. Tuyls, Hardware intrinsic security from physically unclon-
able functions, in Towards Hardware-Intrinsic Security, ed. by A.-R. Sadeghi, D. Naccache
(Springer, Berlin Heidelberg, Germany, 2010), pp. 39–53
4. C. Herder, M.D. Yu, F. Koushanfar, S. Devadas, Physical unclonable functions and applica-
tions: a tutorial. Proc. IEEE 102(8), 1126–1141 (2014). https://doi.org/10.1109/JPROC.2014.
2320516
5. Q. Xiao, T. Gibbons, H. Lebrun, RFID technology, security vulnerabilities, and countermea-
sures, in Supply Chain the Way to Flat Organisation, ed. by Y. Huo, F. Jia (InTech, Vienna,
Austria, 2009), p. 404
6. R. Maes, I. Verbauwhede, Physically unclonable functions: a study on the state of the art
and future research directions, in Towards Hardware-Intrinsic Security, 1st edn., ed. by A.-R.
Sadeghi, D. Naccache (Springer, Berlin Heidelberg, Germany, 2010), pp. 3–37
7. B.C. Grubel, B.T. Bosworth, M.R. Kossey, H. Sun, A.B. Cooper, M.A. Foster, A.C. Foster,
Silicon photonic physical unclonable function. Opt. Express 25(11), 12710 (2017). https://doi.
org/10.1364/OE.25.012710
8. S.N. Graybeal, P.B. Mcfate, S.N. Graybeal, P.B. Mcfate, Getting Out of the STARTing Block.
Sci. Am. 261(6), 61–67 (2017). https://doi.org/10.2307/24987511
9. R. Pappu, B. Recht, J. Taylor, N. Gershenfeld, Physical one-way functions. Science (80– )
297(5589), 2026–2030 (2002). https://doi.org/10.1126/science.1074376
10. P. Tuyls, B. Škorić, Strong authentication with physical unclonable functions, in Security,
Privacy, and Trust in Modern Data Management, ed. by M. Petković, W. Jonker (Springer,
Berlin, Heidelberg, Germany, 2007), pp. 133–148
11. U. Rührmair, C. Hilgers, S. Urban, A. Weiershäuser, E. Dinter, B. Forster, C. Jirauschek, Optical
PUFs reloaded, in Eprint. Iacr, Org (2013)
12. D.W. Bauder, An anti-counterfeiting concept for currency systems, in Sandia Natl. Labs, Albu-
querque, NM, Tech. Rep. PTK-11990 (1983)
13. G. Hammouri, A. Dana, B. Sunar, CDs have fingerprints too, in Cryptographic Hardware
and Embedded Systems-CHES 2009, ed. by C. Clavier, K. Gaj (Springer, Berlin, Heidelberg,
Germany, 2009), pp. 348–362
14. G. DeJean, D. Kirovski, RF-DNA: radio-frequency certificates of authenticity, in Cryptographic
Hardware and Embedded Systems-CHES 2007, ed. by P. Paillier, I. Verbauwhede (Springer,
Berlin, Heidelberg, Germany, 2007), pp. 346–363
15. R. Indeck, M. Muller, Method and apparatus for fingerprinting magnetic media, US Patent No.
5,365,586 (1994)
16. S. Vrijaldenhoven, Acoustical physical uncloneable functions, M.S. thesis, Department of
Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, Nether-
lands, 2004. Available: https://pure.tue.nl/ws/files/46971492/600055-1.pdf. Accessed 21 Mar
2019
3 Phase Change Memory for Physical Unclonable Functions 87

17. K. Lofstrom, W. Daasch, D. Taylor, IC identification circuit using device mismatch, in 2000
IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.
00CH37056), 9 Feb 2000, San Francisco, CA, USA (Online). Available: IEEE Xplore, https://
ieeexplore.ieee.org/document/839821. Accessed 21 Mar 2019. https://doi.org/10.1109/ISSCC.
2000.839821
18. R. Helinski, D. Acharyya, J.P. Annual, A physical unclonable function defined using power
distribution system equivalent resistance variations, in Proceedings of the 46th Annual Design
Automation Conference. ACM, Jul 26–31 2009, San Francisco, CA, USA (Online). Available:
IEEE Xplore, https://ieeexplore.ieee.org/document/5227103. Accessed 21 Mar 2019. https://
doi.org/10.1145/1629911.1630089
19. P. Tuyls, G. Schrijen, B. Škorić, Read-proof hardware from protective coatings, in International
Workshop on Cryptographic Hardware and Embedded Systems, eds. by L. Goubin M. Matsui
(Springer, Berlin, Heidelberg, Germany, 2006)
20. J. Guajardo, B. Škorić, P. Tuyls, S.S. Kumar, T. Bel, A.H.M. Blom, G.-J. Schrijen, Anti-
counterfeiting, key distribution, and key storage in an ambient world via physical unclonable
functions. Inf. Syst. Front. 11(1), 19–41 (2009). https://doi.org/10.1007/s10796-008-9142-z
21. J. Lee, D. Lim, B. Gassend, G. Suh, M. van Dijk, S. Devadas, A technique to build a secret
key in integrated circuits for identification and authentication applications, in 2004 Symposium
on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No. 04CH37525), 17–19 Jun 2004,
Honolulu, HI, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/
1346548. Accessed 21 Mar 2019. https://doi.org/10.1109/VLSIC.2004.1346548
22. D. Lim, J.W. Lee, B. Gassend, G.E. Suh, M. van Dijk, S. Devadas, Extracting secret keys from
integrated circuits. IEEE Trans. Very Large Scale Integr. Syst. 13(10), 1200–1205 (2005).
https://doi.org/10.1109/TVLSI.2005.859470
23. B. Gassend, D. Clarke, M. van Dijk, S. Devadas, Silicon physical random functions, in Pro-
ceedings of the 9th ACM conference on Computer and communications security, 18–22 Nov
2002, Washington, DC, USA (Online). Available: ACM Digital Library, https://dl.acm.org/
citation.cfm?id=586132. Accessed 21 Mar 2019. https://doi.org/10.1145/586110.586132
24. B. Gassend, D. Lim, D. Clarke, M. van Dijk, S. Devadas, Identification and authentication of
integrated circuits. Concurr. Comput. Pract. Exp. 16(11), 1077–1098 (2004). https://doi.org/
10.1002/cpe.805
25. U. Rührmair, J. Sölter, F. Sehnke, On the foundations of physical unclonable functions, in IACR
Cryptology ePrint Archive (2009), p. 277
26. M. Majzoobi, F. Koushanfar, M. Potkonjak, Testing techniques for hardware security, in
IEEE International Test Conference (ITC), 28–30 Oct 2008, no. 31.3, Santa Clara, CA, USA
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/4700636. Accessed 21
Mar 2019. https://doi.org/10.1109/TEST.2008.4700636
27. D.E. Holcomb, W.P. Burleson, K. Fu, Initial SRAM state as a fingerprint and source of true
random numbers for RFID tags, in Proceedings of the Conference on RFID Security, Graz,
Austria, vol. 7. no. 2, p. 1 (2007)
28. Y. Su, J. Holleman, B. Otis, A 1.6 pJ/bit 96% stable chip-ID generating circuit using pro-
cess variations, in IEEE International Solid-State Circuits Conference (ISSCC), 11–15 Feb
2007, San Francisco, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/
document/4242437. Accessed 21 Mar 2019
29. S.S. Kumar, J. Guajardo, R. Maes, G.J. Schrijen, P. Tuyls, The butterfly PUF protecting IP on
every FPGA, in 2008 IEEE International Workshop on Hardware-Oriented Security and Trust,
9 Jun 2008, Anaheim, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/
document/4559053. Accessed 21 Mar 2019. https://doi.org/10.1109/HST.2008.4559053
30. R. Maes, P. Tuyls, I. Verbauwhede, Intrinsic PUFs from flip-flops on reconfigurable devices,
in 3rd Benelux workshop on information and system security (WISSec 2008), Eindhoven,
Netherlands, vol. 17 (2008)
88 N. Noor and H. Silva

31. C. Helfmeier, C. Boit, D. Nedospasov, and J. P. Seifert, “Cloning physically unclonable


functions,” in 2013 IEEE International Symposium on Hardware-Oriented Security and
Trust (HOST), 2–3 Jun 2013, Austin, TX, USA (Online). Available: IEEE Xplore, https://
ieeexplore.ieee.org/document/6581556. Accessed 21 Mar 2019. https://doi.org/10.1109/HST.
2013.6581556
32. S. Katzenbeisser, Ü. Kocabaş, V. van der Leest, A.-R. Sadeghi, G.-J. Schrijen, C. Wachsmann,
Recyclable PUFs: logically reconfigurable PUFs. J. Cryptogr. Eng. 1(3), 177–186 (2011).
https://doi.org/10.1007/s13389-011-0016-9
33. L. Zhang, Z.H. Kong, C.H. Chang, A. Cabrini, G. Torelli, Exploiting process variations and
programming sensitivity of phase change memory for reconfigurable physical unclonable func-
tions. IEEE Trans. Inf. Forensics Secur. 9(6), 921–932 (2014). https://doi.org/10.1109/TIFS.
2014.2315743
34. K. Kursawe, A.R. Sadeghi, D. Schellekens, B. Škorić, and P. Tuyls, Reconfigurable physical
unclonable functions - enabling technology for tamper-resistant storage, in 2009 International
Workshop on Hardware-Oriented Security and Trust (HOST 2009), 27 Jul 2009, San Francisco,
CA, USA, (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/5225058.
Accessed 21 Mar 2019. https://doi.org/10.1109/HST.2009.5225058
35. R.R. Schaller, Moore’s law: past, present and future. IEEE Spectr. 34(6), 52–59 (1997). https://
doi.org/10.1109/6.591665
36. Y. Gao, D.C. Ranasinghe, S.F. Al-Sarawi, O. Kavehei, D. Abbott, Emerging physical unclon-
able functions with nanotechnology. IEEE Access 4, 61–80 (2016). https://doi.org/10.1109/
ACCESS.2015.2503432
37. R.S. Khan, N. Noor, A. Ciardullo, S. Muneer, L. Adnane, F. Dirisaglik, A. Cywar, C. Lam, Y.
Zhu, H. Silva, A. Gokirmak, A study on stochasticity in hexagonal close packed Ge2Sb2Te5
nanowires for possible physical unclonable function (PUF) implementation, in 2017 Material
Research Society (MRS) Spring Meeting & Exhibit, Phoenix, AZ, USA, 17–21 Apr 2017
38. R.S. Khan, N. Noor, A. Ciardullo, S. Muneer, L. Adnane, F. Dirisaglik, A. Cywar, C.
Lam, Y. Zhu, H. Silva, A. Gokirmak, A study on stochasticity in hexagonal close packed
Ge2Sb2Te5 nanowires, in 2016 International Semiconductor Device Research Symposium
(ISDRS), Bethesda, MD, USA, 7–9 Dec 2016
39. S.R. Ovshinsky, Reversible electrical switching phenomena in disordered structures. Phys. Rev.
Lett. 21(20), 1450–1453 (1968). https://doi.org/10.1103/PhysRevLett.21.1450
40. C.H. Sie, Electron microprobe analysis and radiometric microscopy of electric field induced
filament formation on the surface of AsTeGe glass. J. Non. Cryst. Solids 4, 548–553 (1970).
https://doi.org/10.1016/0022-3093(70)90092-X
41. E.J. Evans, J.H. Helbers, S.R. Ovshinsky, Reversible conductivity transformations in chalco-
genide alloy films, in Disordered Materials, ed. by D. Adler, B.B. Schwartz, M. Silver (Springer,
Boston, MA, USA, 1991), pp. 17–22
42. N. Yamada, E. Ohno, K. Nishiuchi, N. Akahira, Rapid-phase transitions of GeTe-Sb2Te3
pseudobinary amorphous thin films for an optical disk memory. J. Appl. Phys. 69(5), 2849–2856
(1991). https://doi.org/10.1063/1.348620
43. S.W. Fong, C.M. Neumann, H.S.P. Wong, Phase-change memory—towards a storage-class
memory. IEEE Trans. Electron Devices 64(11), 4374–4385 (2017). https://doi.org/10.1109/
TED.2017.2746342
44. R.F. Freitas, W.W. Wilcke, Storage-class memory: the next storage system technology. IBM J.
Res. Dev. 52(4/5), 439–447 (2008). https://doi.org/10.1147/rd.524.0439
45. Semiconductor Industry Association, International Technology Roadmap for Semiconductors
(ITRS)—Emerging Research Devices (2013)
46. Intel Optane Technology, Revolutionizing Memory and Storage (Online). Available: https://
www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.
html
47. G.W. Burr, M.J. Breitwisch, M. Franceschini, D. Garetto, K. Gopalakrishnan, B. Jackson, B.
Kurdi, C. Lam, L.A. Lastras, A. Padilla, Phase change memory technology. J. Vac. Sci. Technol.
B 28, 223 (2010). https://doi.org/10.1116/1.3301579
3 Phase Change Memory for Physical Unclonable Functions 89

48. H.P. Wong, S. Raoux, S. Kim, J. Liang, J.P. Reifenberg, B. Rajendran, M. Asheghi, K.E.
Goodson, Phase change memory. Proc. IEEE 98(12), 2201–2227 (2010)
49. D.H. Im, J.I. Lee, S.L. Cho, H.G. An, D.H. Kim, I.S. Kim, H. Park, D.H. Ahn, H. Horii, S.O.
Park, U.-I. Chung, J.T. Moon, A unified 7.5 nm dash-type confined cell for high performance
PRAM device, in 2008 IEEE International Electron Devices Meeting (IEDM), 15–17 Dec
2008, San Francisco, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/
document/4796654. Accessed 21 Mar 2019. https://doi.org/10.1109/IEDM.2008.4796654
50. J. Liang, R.G.D. Jeyasingh, H.Y. Chen, H.S.P. Wong, A 1.4 µA reset current phase change
memory cell with integrated carbon nanotube electrodes for cross-point memory applica-
tion, in 2011 Symposium on VLSI Technology - Digest of Technical Papers, 14–16 Jun 2011,
Honolulu, HI, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/
5984659. Accessed 21 Mar 2019
51. W. Kim, M. BrightSky, T. Masuda, N. Sosa, S. Kim, R. Bruce, F. Carta, G. Fraczak, H.Y. Cheng,
A. Ray, Y. Zhu, H.L. Lung, K. Suu, C. Lam, ALD-based confined PCM with a metallic liner
toward unlimited endurance, in 2016 IEEE International Electron Devices Meeting (IEDM),
3–7 Dec 2016, San Francisco, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.
ieee.org/document/7838343. Accessed 21 Mar 2019. https://doi.org/10.1109/IEDM.2016.
7838343
52. H.L. Lung, Y.H. Ho, Y. Zhu, W.C. Chien, S. Kim, W. Kim, H.Y. Cheng, A. Ray, M. Brightsky,
R. Bruce, C.W. Yeh, C. Lam, A novel low power phase change memory using inter-granular
switching, in 2016 IEEE Symposium on VLSI Technology, 14–16 Jun 2016, Honolulu, HI, USA
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/7573405. Accessed 21
Mar 2019. https://doi.org/10.1109/VLSIT.2016.7573405
53. T. Nirschl et al., Write strategies for 2 and 4-bit multi-level phase-change memory, in 2007
IEEE International Electron Devices Meeting (IEDM), 10–12 Dec 2007, Washington, DC, USA
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/4418973. Accessed 21
Mar 2019. https://doi.org/10.1109/IEDM.2007.4418973
54. M. Stanisavljevic, A. Athmanathan, N. Papandreou, H. Pozidis, E. Eleftheriou, Phase-change
memory: Feasibility of reliable multilevel-cell storage and retention at elevated tempera-
tures, in 2015 IEEE International Reliability Physics Symposium (IRPS), 19–23 Apr 2015
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/7112747. Accessed 21
Mar 2019. https://doi.org/10.1109/IRPS.2015.7112747
55. N. Papandreou, H. Pozidis, A. Pantazi, A. Sebastian, M. Breitwisch, C. Lam, E. Elefthe-
riou, Programming algorithms for multilevel phase-change memory, in 2011 IEEE Interna-
tional Symposium of Circuits and Systems (ISCAS), 15–18 May 2011, Rio de Janeiro, Brazil
(Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/5937569. Accessed 21
Mar 2019. https://doi.org/10.1109/ISCAS.2011.5937569
56. N. Noor, S. Muneer, L. Adnane, R.S. Khan, R. Ramadan, F. Dirisaglik, A. Cywar, C. Lam,
Y. Zhu, A. Gokirmak, H. Silva, Pulse-mode electrical resistance trimming of Ge2Sb2Te5
phase change memory (PCM) line cells, in 2016 International Semiconductor Device Research
Symposium (ISDRS), Bethesda, MD, USA, 7–9 Dec 2016
57. N. Noor, S. Muneer, L. Adnane, R.S. Khan, A. Gorbenko, F. Dirisaglik, A. Cywar, C. Lam, Y.
Zhu, A. Gokirmak, H. Silva, Utilizing programming variability in phase change memory cells
for security, in 2017 Mater. Res. Soc. (MRS) Fall Meeting & Exhibit, Boston, MA, USA, 26
Nov–1 Dec 2017
58. M. Boniardi, D. Ielmini, S. Lavizzari, A.L. Lacaita, A. Redaelli, A. Pirovano, Statistics of resis-
tance drift due to structural relaxation in phase-change memory arrays. IEEE Trans. Electron
Devices 57(10), 2690–2696 (2010). https://doi.org/10.1109/TED.2010.2058771
59. U. Russo, D. Ielmini, A. Redaelli, A.L. Lacaita, Intrinsic data retention in nanoscaled phase-
change memories—Part I: Monte Carlo model for crystallization and percolation. IEEE Trans.
Electron Devices 53(12), 3032–3039 (2006). https://doi.org/10.1109/TED.2006.885527
60. B. Gleixner, A. Pirovano, J. Sarkar, F. Ottogalli, E. Tortorelli, M. Tosi, R. Bez, Data reten-
tion characterization of phase-change memory arrays, in 2007 IEEE International Reliability
90 N. Noor and H. Silva

Physics Symposium (IRPS), 15–19 Apr 2007, Phoenix, AZ, USA (Online). Available: IEEE
Xplore, https://ieeexplore.ieee.org/document/4227689. Accessed 21 Mar 2019
61. F. Dirisaglik, G. Bakan, S. Muneer, K. Cil, L. Sullivan, Z. Jurado, J. Rarey, L. Zhang, R.
Nowak, M. Akbulut, Y. Zhu, C. Lam, H. Silva, A. Gokirmak, High temperature electrical
characterization of phase change material: Ge2Sb2Te5, in 2013 Materials Research Society
(MRS) Fall Meeting & Exhibit, Boston, MA, USA, 1–6 Dec 2013
62. F. Dirisaglik, K. Cil, M. Wennberg, A. King, M. Akbulut, Y. Zhu, C. Lam, A. Gokirmak, H.
Silva, Crystalization times of Ge2Sb2Te5 nanostructures as a function of temperature,” in 2012
American Physical Society (APS) March Meeting, Boston, MA, USA, 27 Feb–2 Mar 2012
63. N. Noor, K. Cil, L. Sullivan, S. Muneer, F. Dirisaglik, A. Cywar, C. Lam, Y. Zhu, A. Gokir-
mak, H. Silva, An experimental study on waveform engineering for Ge2Sb2Te5 phase change
memory cells, in 2015 Materials Reserch Society (MRS) Fall Meeting & Exhibit, Boston, MA,
USA, 29 Nov–4 Dec 2015
64. N. Noor, R.S. Khan, S. Muneer, L. Adnane, R. Ramadan, F. Dirisaglik, A. Cywar, C. Lam, Y.
Zhu, A. Gokirmak, H. Silva, Short and long time resistance drift measurement in intermediate
states of Ge2Sb2Te5 phase change memory line cells, in 2017 Material Research Society (MRS)
Spring Meeting & Exhibit, Phoenix, AZ, USA, 17–21 Apr 2017
65. F. Dirisaglik, G. Bakan, Z. Jurado, S. Muneer, M. Akbulut, J. Rarey, L. Sullivan, M. Wennberg,
A. King, L. Zhang, R. Nowak, C. Lam, H. Silva, A. Gokirmak, High speed, high temper-
ature electrical characterization of phase change materials: metastable phases, crystallization
dynamics, and resistance drift. Nanoscale 7(40), 16625–16630 (2015). https://doi.org/10.1039/
C5NR05512A
66. D. Ielmini, D. Sharma, S. Lavizzari, A.L. Lacaita, Physical mechanism and temperature accel-
eration of relaxation effects in phase-change memory cells, in 2008 IEEE International Relia-
bility Physics Symposium (IRPS), 27 Apr–1 May 2008, Phoenix, AZ, USA (Online). Available:
IEEE Xplore, https://ieeexplore.ieee.org/document/4558952. Accessed 21 Mar 2019. https://
doi.org/10.1109/RELPHY.2008.4558952
67. F. Dirisaglik, High-temperature electrical characterization of Ge2Sb2Te5 phase change mem-
ory devices. Ph.D. dissertation, Department of Electrical & Computer Engineering, University
of Connecticut, Storrs, CT, USA, 2014. http://digitalcommons.uconn.edu/dissertations/577/.
Accessed 21 Mar 2019
68. R.S. Khan, N. Noor, C. Jin, J. Scoggin, Z. Woods, S. Muneer, A. Ciardullo, P.H. Nguyen,
A. Gokirmak, M. van Dijk, H. Silva, Phase change memory and its applications in hard-
ware security, in Security Oppotunities in Nano Devices and Emerging Technologies, 1st ed.,
M. Tehranipoor, D. Forte, G.S. Rose, S. Bhunia (CRC Press, Boca Raton, FL, USA, 2017),
pp. 93–114
69. A. Pirovano, A. Redaelli, F. Pellizzer, F. Ottogalli, M. Tosi, D. Ielmini, A.L. Lacaita, R. Bez,
Reliability study of phase-change nonvolatile memories. IEEE Trans. Device Mater. Reliab.
4(3), 422–427 (2004). https://doi.org/10.1109/TDMR.2004.836724
70. D. Fugazza, D. Ielmini, S. Lavizzari, A.L. Lacaita, Random telegraph signal noise in phase
change memory devices, in 2010 IEEE International Reliability Physics Symposium (IRPS),
2–6 May 2010, Anaheim, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.
org/document/5488741. Accessed 21 Mar 2019. https://doi.org/10.1109/IRPS.2010.5488741
71. G. Betti Beneventi, A. Calderoni, P. Fantini, L. Larcher, P. Pavan, Analytical model for low-
frequency noise in amorphous chalcogenide-based phase-change memory devices. J. Appl.
Phys. 106(5), 1–8 (2009). https://doi.org/10.1063/1.3160332
72. L. Zhang, Z.H. Kong, C.H. Chang, PCKGen: a phase change memory based cryptographic key
generator, in 2013 IEEE International Symposium on Circuits and Systems (ISCAS), 19–23 May
2013, Beijing, China (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/document/
6572128. Accessed 21 Mar 2019. https://doi.org/10.1109/ISCAS.2013.6572128
73. D.-H. Kang, D.-H. Ahn, K.-B. Kim, J.F. Webb, K.-W. Yi, One-dimensional heat conduction
model for an electrical phase change random access memory device with an 8F2 memory cell
(F=0.15 µm). J. Appl. Phys. 94(5), 3536–3542 (2003). https://doi.org/10.1063/1.1598272
3 Phase Change Memory for Physical Unclonable Functions 91

74. E. Piccinini, R. Brunetti, M. Rudan, Self-Heating Phase-Change Memory-Array Demonstrator


for True Random Number Generation. IEEE Trans. Electron Devices 64(5), 2185–2192 (2017).
https://doi.org/10.1109/TED.2017.2673867
75. S. Braga, A. Cabrini, G. Torelli, Theoretical analysis of the RESET operation in phase-change
memories. Semicond. Sci. Technol. 24(11), 115008 (2009). https://doi.org/10.1088/0268-1242/
24/11/115008
76. S.-Y. Lee, S.-M. Yoon, Y.-S. Park, B.-G. Yu, S.-H. Kim, S.-H. Lee, Low power and high speed
phase-change memory devices with silicon-germanium heating layers. J. Vac. Sci. Technol. B
Microelectron. Nanom. Struct. 25(4), 1244 (2007). https://doi.org/10.1116/1.2752515
77. F. Xiong, M.-H. Bae, Y. Dai, A.D. Liao, A. Behnam, E.A. Carrion, S. Hong, D. Ielmini,
E. Pop, Self-aligned nanotube-nanowire phase change memory. Nano Lett. 13(2), 464–469
(2013). https://doi.org/10.1021/nl3038097
78. K. Attenborough, G.A.M. Hurkx, R. Delhougne, J. Perez, M.T. Wang, T.C. Ong, L. Tran, D.
Roy, D.J. Gravesteijn, M.J. van Duuren, Phase change memory line concept for embedded
memory applications, in 2010 International Electron Devices Meeting (IEDM), 6–8 Dec 2010,
San Francisco, CA, USA (Online). Available: IEEE Xplore, https://ieeexplore.ieee.org/abstract/
document/5703442. Accessed 21 Mar 2019. https://doi.org/10.1109/IEDM.2010.5703442
79. A. Chen, Utilizing the variability of resistive random access memory to implement recon-
figurable physical unclonable functions. IEEE Electron Device Lett. 36(2), 138–140 (2014).
https://doi.org/10.1109/LED.2014.2385870
Chapter 4
Applications of Resistive Switching
Memory as Hardware Security Primitive

Roberto Carboni and Daniele Ielmini

Abstract With the widespread diffusion of ubiquitous mobile computing and inter-
net of things (IoT), secured communication and chip authentication become key
requirements. Hardware-based security concepts generally provide the best perfor-
mance in terms of good security standard, low power consumption, and large area
density. In these concepts, the stochastic properties of the device, such as the phys-
ical and geometrical variations of the process, are harnessed to generate random
bits and functions. This is the basis for the true-random number generator (TRNG),
where true-random numbers are generated by exploiting the physics and randomness
of nanoscale devices. The same random variations can also be used to implement
physical unclonable function (PUF) for the authentication of individual hardware
chips. Emerging memory devices rely on unique physical mechanisms for transport
and switching, thus appear as the ideal source of entropy for hardware TRNG and
PUF. These novel memory concepts include resistive switching memory (RRAM),
phase change memory (PCM), and spin-transfer torque magnetic memory (STT-
MRAM) devices. As these devices are increasingly adopted as memory and comput-
ing elements in several applications, exploiting their intrinsic stochastic variations
for TRNG and PUF becomes an attractive solution for low-cost, high-performance
security primitives. This chapter provides an overview of TRNG and PUF adopting
emerging memory devices as the fundamental entropy source. TRNG concepts are
classified by the microscopic stochastic variation that is adopted as entropy source,
namely, current noise, switching delay time, or switching voltage. While most TRNG
concepts rely on RRAM devices, we also show novel concepts using STT-MRAM
devices which take advantage of the excellent endurance and speed of switching.
The TRNG schemes are discussed in terms of the simplicity of the design, e.g., the
ability to generate random bits without a probability tracking by adopting a differ-
ential circuit scheme. Finally, the status of PUF implementations using RRAM and
their array circuits are presented and discussed.

R. Carboni · D. Ielmini (B)


Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano,
Piazza L. da Vinci 32, 20133 Milano, Italy
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2020 93


M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_4
94 R. Carboni and D. Ielmini

4.1 Introduction

Information security has been a topic of intense research since the mid 1970s,
when the main purpose was to guarantee the confidentiality and integrity of data
within mainframe computers [1]. In more recent times, as mobile computers, inter-
net of things (IoT), and cloud computing are becoming ubiquitous, there is an ever-
increasing need for secure communication among them [2]. Portable devices such as
smartphones and tablets can now enable financial transactions and act as the primary
authentication token for the user. Therefore, there is a need for electronic chips to
(1) securely authenticate and be authenticated by other parties, (2) securely handle
private/sensitive information, and (3) operate in an untrusted environment where the
adversary might have physical access to the system [3]. These tasks must be imple-
mented in mobile devices at the level of integrated circuits (IC), featuring at the
same time both low power consumption and small area occupation. For application
in large-scale IoT [4] and cyber-physical systems (CPS) [5], security methodologies
must also feature high speed, low cost, and robustness to physical and side-channel
attacks [6].
Hardware-intrinsic security primitives such as true-random number generators
(TRNG) and physical unclonable functions (PUF) are gaining interest towards
low-cost and high-performance security tools [7]. On the one hand, TRNG can con-
veniently and efficiently generate the random bitstreams required by most of crypto-
graphic and security applications [8, 9]. On the other hand, PUF can securely store
a secret key in the random characteristics of an IC, by, e.g., exploiting the random
process fluctuations, and enabling fast and low-cost authentication and secure key
storage [3]. Nanodevices are currently considered as the most promising approach
for TRNG and PUF thanks to the small area, the low power consumption, the scal-
ability, the 3D integration, and the ability to offer intrinsic stochastic phenomena
via the inherent physical transport and switching mechanisms. These properties are
all extremely beneficial for portable and IoT applications. Nanoelectronics can pro-
vide scalable device concepts via either the well-established complementary metal
oxide semiconductor (CMOS) technology, or via alternative memory concepts based
on resistive, phase change, magnetic and ferroelectric materials [10], sometimes
referred to as the “memristive” concepts [11]. CMOS-based TRNG [8] and PUF
[12] were first introduced thanks to the strong integration capabilities and techno-
logical maturity. Nevertheless, they soon demonstrated a limited entropy quality and
the need for increased area and power overhead to improve randomness [13]. On
the other hand, memristive devices are currently gaining increasing interest for hard-
ware security thanks to their intrinsic stochastic behavior that can be harnessed for
high-performance and low cost, low-energy on-chip entropy sources.
This chapter provides an overview of the current state-of-the-art for both TRNG
and PUF implementations with resistive (memristive) switching devices. The focus
is on the applications of emerging memory technologies such as resistive switching
memory (RRAM) and spin-transfer torque magnetic memory (STT-MRAM), com-
bining binary stochastic switching, good endurance and scalable device area. The
4 Applications of Resistive Switching Memory as Hardware Security Primitive 95

chapter is organized as follows: Sect. 4.2 describes the general framework of hardware
security primitives such as TRNG and strong PUF. Section 4.3 is a short overview of
the RRAM device, including the device structure and the switching mechanisms. Sec-
tions 4.4–4.8 presents the possible approaches toward RRAM-based TRNG, based
on the stochastic phenomena in RRAM devices such as current noise (Sect. 4.4),
switching time variability (Sect. 4.5) and switching voltage variability (Sect. 4.6).
Section 4.7 presents TRNG schemes based on differential pairs of RRAM, while
Sect. 4.8 illustrates STT-MRAM-based TRNG concepts. Section 4.9 reviews recently
presented PUF concepts based on the RRAM technology. Finally, Sect. 4.10 provides
a summary and an outlook for the research and development of hardware security
using RRAM devices.

4.2 Hardware Security Primitives

4.2.1 PRNG and TRNG

Security of internet-based data transmission usually requires the generation of ran-


dom keys [8, 9] via an on-chip random number generator (RNG). In the era of IoT,
the need for compact and reliable RNG circuits with high entropy and high through-
put has been considerably increased [14]. Other applications of RNG include the
emerging computing paradigms, such as stochastic [15–17] and brain-inspired neu-
romorphic computing [18, 19], which inherently rely on large streams of random ana-
log/digital signals for their operation . Within this scenario (Fig. 4.1), RNG circuits
providing reliable random numbers with small circuit area, low-energy consumption,
and high throughput become essential.
A classical method for generating random bits is the pseudo-random number
generator (PRNG) which can generate a random-looking bitstream according to
a deterministic algorithm initialized by a seed (e.g., interrupt events, kernel calls,
incoming TCP/IP request, etc.) [8, 20]. For example, a linear-feedback shift register
(LFSR) is a digital circuit that, after being initialized with a seed, can generate a
deterministic sequence of pseudo-random numbers [21]. However, LFSR is a finite-
state machine, hence its output will be periodic, namely nonrandom over a sufficiently
large time scale. Also, the seed might be manipulated since it is derived from user
activity, or the knowledge of internal state and feedback tap structure of the LFSR
might allow operation monitoring. These limitations arise from the fact, already
recognized by Von Neumann [22], that random numbers cannot originate from a
deterministic, arithmetical algorithm. These are all critical issues that make the PRNG
output exposed to crypto-analysis [8, 23]. Due to the limited randomness and the high
vulnerability, these systems are generally unsuitable for integration in IoT devices
[24].
The data protection against cyberattacks can be improved with the true RNG
(TRNG), where the output bitstream is obtained via an inherently stochastic physical
96 R. Carboni and D. Ielmini

process [25]. It has been demonstrated that the high unpredictability of hardware-
based TRNG makes them more reliable with respect to software-based PRNG sys-
tems [26, 27]. In recent years, various physical entropy sources were proposed for
TRNG, like the random telegraph noise (RTN) in dielectrics [28, 29] , stochastic
quantum processes [25], stochastic spintronic phenomena [30, 31], and memristive
transport and switching [32–34]. Several stochastic entropy sources were identified
in both CMOS technologies and emerging memristive devices.
CMOS-based TRNGs were demonstrated by exploiting the noise in scaled MOS-
FET [28], the metastability at turn-on of cross-coupled inverter pair (namely, SRAM
core) [8], or the increased noise of dual drain MOSFET driving a voltage-controlled
oscillator (VCO) [35]. All these concepts take advantage of the mature integration
capability of CMOS logic chips. However, CMOS-type TRNGs suffer from vari-
ous drawbacks: for instance, a colored noise spectrum, e.g., due to capture/emission
events originating 1/f noise, results in a biased output bitstream, requiring consid-
erable post-processing and a consequent circuit overhead. Noise in CMOS devices
also critically depends on environmental/process fluctuations, whose impact can be
minimized only with entropy-tracking feedback loops [8], thus resulting in additional
power consumption, circuit area and added complexity. On the other hand, memris-
tive concepts such as RRAM and STT-MRAM enable ultra-small entropy source
with high-quality randomness, which makes these technologies very promising for
TRNG.

4.2.2 Strong and Weak PUF

Secret information transmission with classical mathematical cryptography has relied


on sufficiently hard-to-break algorithms (i.e., the “lock”) and secret keys since its
inception [36]. Typically, the secret key is stored in a nonvolatile electrically erasable
read-only memory (EEPROM) or a battery-backed static random-access memory
(SRAM) which results in a relatively large area occupation and power consumption.
For low-power IoT devices, storing secret keys at low-energy consumption in a secure
way is becoming an increasingly difficult task [37], especially considering emerging
attack techniques such as the side-channel attacks [38]. This has led to an intense
research interest for hardware-intrinsic security primitives that do not require secret
key storage in the digital memory.
In this scenario, the physical unclonable function (PUF) is a promising solution.
A PUF is a physical system that statistically maps an input challenge to an output
response through a secret key controlled by a stochastic property of the chip, e.g., the
silicon process variations or the intrinsic physical variability of device parameters
[3]. In general, PUF security is guaranteed by the extreme difficulty of accessing the
physical features of the hardware, and by the negligible probability that two chips
are manufactured with the same or similar set of parameters. These properties make
PUF an excellent scheme to uniquely identify a component or a circuit, thus enabling
hardware authentication and preventing IC counterfeiting [12].
4 Applications of Resistive Switching Memory as Hardware Security Primitive 97

Fig. 4.1 Schematic representation of the current information security scenario with some applica-
tions of hardware security primitives, comprising data cryptography, hardware (HW) authentication
and stochastic/neuromorphic computing. The main building blocks are the random number gen-
erator (RNG) and the physical unclonable function (PUF). The RNG can be implemented either
as a pseudo-random number generator (PRNG), such as the linear-feedback shift register (LFSR)
in the figure, or as a true-random number generator (TRNG), which harnesses the stochasticity of
physical phenomena (like the noise in the current trace in the figure) to generate a random bitstream.
A PUF systems introduces a challenge (c) response (r) relation, where r = f (c) and f (·) is given by
the physical details defining that specific PUF instance. In the figure, a typical SRAM-based PUF
circuit is schematically shown. Adapted from [32]

A PUF system can be represented as a black box that for each input challenge c
returns an output response r = f (c), with f describing the unique internal physical
characteristics of the PUF (Fig. 4.1). The set of possible challenge-response pairs
(CRPs) defines a particular PUF instance.
Depending on the number of unique CRPs, there are two main categories of PUFs:
the weak PUF, which can only support a relatively small number of challenges, and
the strong PUF, with an extremely large set of CRPs [3]. More specifically, in a
weak PUF the number of CRPs increases linearly or polynomially with the number
of basic cells, i.e., the building blocks forming the PUF system [39], while the
number of CRPs increases exponentially in a strong PUF [12]. The weak PUF is often
referred to as physical obfuscated key (POK), since its primary task is the generation
or storage of a cryptographic key [40, 41]. The most popular implementation of
the PUF circuit is based on the digital static random-access memory (SRAM), and
exploits the metastable states of cross-couple inverters [42]. In each inverter pair, the
response bit is determined by which of the two nominally equal-sized inverters of the
memory cell addressed by the challenge reaches the tri-state point faster. Memory-
based PUFs (POKs) are relatively easy to design even with low area overhead. Such
memory-based systems are essentially weak PUFs since the set of CRP is limited
by the available memory capacity [43]. As a result, their CRP set can be completely
explored within polynomial time, compromising their use as identification tools. On
the other hand, given their large CRPs, strong PUFs are practically immune from
brute-force attacks [12] and are therefore suitable for low-cost authentication.
Although there is no general metric to certify a PUF system in terms of security
properties, the following characteristics can be considered as the best figures of merit
(FOM) for PUF [12]:
98 R. Carboni and D. Ielmini

• Reliability: A PUF should always give the same response to a given challenge over
a wide range of operating conditions (voltage, temperature, etc.)
• Unpredictability: The PUF response to an arbitrary challenge should not be pre-
dicted based on the CRPs of another PUF or from the previous CRPs of the same
PUF.
• Unclonability: The CRP mapping of a PUF cannot be physically or mathematically
cloned, even for the original manufacturer of the PUF.
• Physical Unbreakability: Any physical attempt to maliciously modify the PUF
should result in a malfunction or a permanent damage of the chip.
The practical evaluation of these FOMs for a specific PUF system is not straightfor-
ward in general, as discussed in Sect. 4.9. Although extremely promising for low-cost
chip authentication, the PUF should be strong enough against attacks aiming at build-
ing a model for the PUF. This kind of attacks try to develop a model of a PUF instance
by looking at a subset of its input–output pairs. Among these, the machine-learning
attacks have been demonstrated to be particularly successful [44, 45]. The careful
co-design of the stochastic memory and the circuit-dependent function is therefore
essential for developing strong PUFs for hardware security.

4.3 RRAM Devices for Hardware Security

Among the emerging memory technologies, RRAM is one of the most promising due
to its nonvolatile retention, fast switching, low power, and CMOS compatibility [46–
49]. The RRAM integration in crosspoint arrays, in the back-end-of-line (BEOL),
and adopting 3D structures allows for increased density and easy of integration [50–
53]. Figure 4.2a shows a bipolar RRAM device, comprising a HfO2 switching layer
sandwiched between a TiN bottom electrode (BE) and Ti top electrode (TE). The Ti
layer at the TE acts as an oxygen exchange layer inducing the generation of oxygen
vacancies in the oxide layer, which thus become HfOx with x < 2 [34]. RRAM devices
are usually integrated into a one-transistor/one-resistor (1T1R) structure to enable the
control of the resistance level by limiting the current flowing in the select transistor
during the set transition. Figure 4.2b shows the current–voltage (I–V) characteristics
of the RRAM device, where the application of a positive voltage to the TE causes a
set transition from the high resistance state (HRS) to the low-resistance state (LRS)
in correspondence of the set voltage Vset . The application of a negative voltage to
the TE induces instead a reset transition from the LRS to the HRS in correspondence
of the reset voltage Vr eset . The resistance window between the LRS and the HRS is
at least one order of magnitude, but can reach 5 orders of magnitude by the adoption
of high band gap dielectrics such as SiOx [54]. A relatively small gate voltage is
applied during the set transition to limit the compliance current IC across the device,
thus allowing to control the LRS resistance according to R = VC /IC , where VC is
a characteristics voltage generally lower than 1 V [55]. The LRS resistance can be
thus controlled by the parameter IC , while the HRS resistance can be controlled by
4 Applications of Resistive Switching Memory as Hardware Security Primitive 99

Fig. 4.2 Schematic of a typical 1T1R structure comprising a RRAM cell integrated on top of the
drain of an integrated MOSFET (a). In this example, the RRAM stack includes a Si-doped HfOx
switching layer, a Ti top electrode (TE) and a TiN bottom electrode (BE). The corresponding I–V
characteristics shows the definition of the set voltage Vset , the compliance current IC , the reset
voltage Vr eset , and the stop voltage Vstop . Reprinted with permission from [34]. Copyright (2016)
IEEE

the maximum negative voltage along the reset sweep, namely the stop voltage Vstop
[54].
RRAM switching is caused by ionic migration induced by the voltage and the local
Joule heating [56]. Because of the atomistic nature of the switching and local impact
of microstructure, such as crystalline grain boundary and orientation, the set and
reset transitions are characterized by a significant random variation [57]. The local
conduction path does not only change during set and reset operations, but is also prone
to stochastic atomistic fluctuations such as defect relaxation and diffusion which can
cause a significant variation in the read current after the programming pulse [58, 59].
RRAM variations thus affect both the device-to-device consistency within a memory
array [60], and the cycle-to-cycle variations within the same device because of the
several different defect configurations. Variations can affect all RRAM parameters,
including the LRS and HRS resistance, the set voltage Vset and the reset voltage
Vr eset . While stochastic variations are critical in hindering memory and computing
application of RRAM [60], they offer the physical source of entropy that is needed
for hardware security primitives.
Stochastic phenomena in RRAM devices can be exploited as entropy source for
TRNG. RRAM schemes for TRNG can be grouped in three classes according to
Fig. 4.3, where the sources of entropy are (a) stochastic noise, (b) stochastic switching
time, or (c) stochastic switching voltage [10].
100 R. Carboni and D. Ielmini

4.4 TRNG Based on Stochastic Noise

The fluctuation of a bistable defect within the RRAM conduction path in either
the LRS or HRS can lead to a relatively large fluctuation of the current between
two levels called random telegraph noise (RTN, see Fig. 4.3a) [61]. RTN induces a
random change in the read current between a low value I0 and a high value I1 . By
sampling the current trace in Fig. 4.3a, one obtains a bimodal distribution of currents
in Fig. 4.3b which can be used to assign the random bits “0” and “1”.
The current fluctuation in RTN can be ascribed to the modification of the charge
state of a bistable defect close to the conductive path, due to, e.g., electron trapping
and detrapping combined by a structural relaxation of the defect. The charge state
affects the carrier concentration close to the defect, thus resulting in a macroscopic
change of the measured current [58]. As the filamentary path diameter of the LRS
becomes smaller, the impact of the individual defect increases markedly, which is
generally evidenced by the difference between the two resistance values ΔR increas-
ing with the square of the average resistance (ΔR ∼ R2 ) [58, 61]. This is similar to
the RTN affecting the channel current in a MOS transistor, resulting from a bistable
fluctuation of the charge state of an oxide defect. As the RTN amplitude can be quite
significant, it can be exploited as an entropy source in RRAM devices.

Fig. 4.3 Random telegraph noise current fluctuations (a) and corresponding probabilistic distri-
bution function (PDF) (b). In (c) the applied voltage pulse and its corresponding current response
evidencing the random delay time Δt, and PDF of Δt (d) with an equally spaced time window to
uniformly attribute bit values 0 and 1. Measured I–V curves evidencing cycle-to-cycle variation
of Vset (e), and PDF of the resistance measured after a stochastic set (f), where sub-distributions
of the high-resistance state and the low-resistance state are attributed to bits 0 and 1, respectively.
Reprinted with permission from Macmillan Publishers Ltd: Nature Electronics [10]. Copyright
(2018)
4 Applications of Resistive Switching Memory as Hardware Security Primitive 101

(a) (b) (c)

Fig. 4.4 a Measured I–V characteristics for negative voltage showing RTN. b Measured read
current as a function of time for read voltage Vr ead = 50, 200 and 350 mV and (b) corresponding
simulations. The RTN switching times Δ t O N and Δ t O F F decrease with Vr ead . Reprinted with
permission from [58]. Copyright (2014) IEEE.

Fig. 4.5 a Schematic representation of the TRNG block diagram, including CRRAM, comparator
and clock control circuit. b Comparator output, showing binary random digital behavior. Reprinted
with permission from [29]. Copyright (2012) IEEE

To understand the impact of RTN on device behavior, Fig. 4.4a shows the mea-
sured current–voltage (I–V) characteristics for a RRAM device with HfOx switch-
ing layer. The current trace was measured at negative voltage in the LRS state and
clearly evidences discrete transitions, typical of RTN. Data show that RTN transition
rate increases at higher Vr ead , which can be better understood by constant-voltage
measurements of current as a function of time in Fig. 4.4b. Here, the average rate of
switching between the two RTN states increases with the read voltage for Vr ead = 50,
200 and 350 mV. Conversely, the average time for which the current remains high
(Δt O N ) and the time for which the current remains low (Δt O F F ) both decrease at
increasing Vr ead . The same behavior can be seen in the numerical simulations of
Fig. 4.4c obtained with a finite-element method (FEM) numerical model of RTN
[58]. The voltage dependence of RTN can be understood as the acceleration of RTN
fluctuation kinetics due to the voltage induced Joule heating. Similarly, RTN can be
accelerated at high ambient temperature [58]. Figure 4.5 shows the architecture of
a TRNG circuit exploiting RTN in RRAM [29]. The TRNG is based on a contact-
resistive random-access memory (CRRAM), integrated on the drain contact of a
102 R. Carboni and D. Ielmini

Fig. 4.6 a Measured read current as a function of time for a device in LRS with R = 10 kΩ
and b corresponding relative standard deviation σ I /I. c and d Calculated current versus time and
corresponding relative standard deviation. e PSD of experimental and calculated noise, showing a
1/f behavior. Results from the analytical model of [59] are reported in (b), (c) and (e). Reprinted
with permission from [59]. Copyright (2015) IEEE

MOS transistor with a 1T1R structure. The 1T1R structure is biased with a voltage
V R , thus any RRAM fluctuation due to RTN results in a fluctuation of the voltage V D
at the transistor drain. The drain potential V D is compared to a reference voltage Vr e f
by an integrated comparator (Fig. 4.5a), leading to a binary random digital output
as shown in Fig. 4.5b. Sampling the digital output at increasing times with a clock
frequency fC K leads to a sequence of random bits provided fC K  f RT N , where
f RT N is the average rate of RTN fluctuations.
Although the scheme is very simple, the TRNG of Fig. 4.5 has few issues related
to both the physical concept and the circuit. The circuit has been reported to have
a relatively large area, namely 2400 F2 in 65 nm technology, i.e., 10 µm2 [29].
Practical TRNG based on RTN phenomena are also affected by a difficult control
of amplitude, rate, and uniformity of the physical RTN. In fact, an unbiased RNG
with equal 50% probability of generating either a “0” or a “1” is obtained only if
the I0 and I1 sub-distributions in Fig. 4.3b have the same area. Also, as previously
described, RTN is affected by temperature and voltage, leading to instabilities of
the RTN entropy source. The amplitude of the RTN should be large enough to be
distinguishable by the comparator stage, while the reference voltage Vr e f needs to be
adjusted carefully depending on the specific level of the resistance and its fluctuation.
The nonuniformity of the “0”/“1” balance in the output bitstream can be compensated
by a digital post-processing such as the von Neumann algorithm, however, this comes
to the expense of an additional circuit area and power overhead.
Most recently, to compensate for the area occupation and other issues of the TRNG
circuit of Fig. 4.5, the current difference in the 1/fβ noise of the RRAM device was
used as the entropy source [32]. The RRAM noise is associated to multi-trap capture
and emission events in defects (e.g., oxygen vacancies) along the conductive filament
(CF) in LRS and the localized conductive path in HRS [59]. Figure 4.6a shows the
read current Ir ead measured for a RRAM in the LRS with an average R = 10 kΩ,
biased with a read voltage Vr ead = 10 mV. Current fluctuations due to the 1/f noise
result in an increasing relative standard deviation σ I /I, where I is the average value of
4 Applications of Resistive Switching Memory as Hardware Security Primitive 103

Fig. 4.7 a Conceptual representation of the entropy harvesting algorithm for TRNG. b Block
diagram of the parallel TRNG circuit, which allows for a 32 Mbps random bitstream. c Minimum
entropies are higher than 0.999 over broad range of operative temperature and voltages. d P-value
of 1000 sequences of 1 Mb bitstreams for the frequency test. Reprinted with permission from [32].
Copyright (2016) IEEE

Ir ead at any time t, while σ I is its standard deviation (Fig. 4.6b). The increasing value
of σ I /I with the time is due to the increasing noise contributions at low frequency,
which is typical of 1/f behavior of noise. The simulation results by a numerical Monte
Carlo model of 1/f noise in Fig. 4.6c and d show similar behavior [59]. Figure 4.6e
shows the measure and calculated power spectral density (PSD) (S I ), evidencing a
clear -1 slope, typical of the 1/f noise.The 1/f noise can be harvested for TRNG by the
circuit shown in Fig. 4.7 [32]. Here, the noisy current is sampled at subsequent times
t and t + Δt, then the two sampled currents are subtracted leading and the difference
ΔI is compared to 0. Finally, the random bit value is assigned to 0 or 1 depending
on ΔI being positive or negative, respectively. With respect to the RTN scheme of
TRNG, the differential scheme allows both for a reduced area of 0.256 µm2 (or 160
F2 in 40 nm technology) and a reduced bias in the probability of extracting a “0” or a
“1” bit. In fact, the differential current ΔI (Fig. 4.7a) follows a Gaussian distribution,
thus ensuring that “0” and “1” have exactly the same probability of 50%. The circuit
design (Fig. 4.7b) allows for a precise current value extraction using a timing sense
amplifier (TSA) and a resistance-to-time converter (RTC) [62], while the parallel
configuration of multiple devices enables up to 32 Mbps operation, with a 0.04 nJ/bit
104 R. Carboni and D. Ielmini

energy efficiency. Test results are finally reported by showing a minimum entropy
higher than 0.999 over a broad range of temperature (−40 < T[◦ C] < 120) and with
different voltages (V D D = 0.1 V) (Fig. 4.7c). The high performance of the scheme is
further demonstrated by the P-value, i.e., a FOM for randomness of the random bit
stream, of 1000 groups of 1 Mb bitstream for the frequency NIST 800-22 test [63]
(Fig. 4.7d).

4.5 TRNG Based on Stochastic Time

A key limitation of noise as a source of entropy is the unpredictable amplitude and


frequency dependence. To better control the generation of random bit, TRNG can
rely on the stochastic properties of switching, namely time and voltage.
Figure 4.3c shows the basic concept for exploiting the variation of the stochastic
delay time (Δt) for the set transition. Assuming that a voltage V slightly larger or
comparable to Vset is applied to a RRAM device in the HRS, set transition occurs
after a certain delay time Δt, where Δt decreases as the applied voltage is increased
[56]. Most importantly, Δt is subject to relatively large variation from cycle to cycle,
due to the ion migration being dependent on the local microstructure and atomistic
migration of ions [59]. The resulting probabilistic distribution of Δt is exponentially
decreasing as shown in Fig. 4.3d. The exponential distribution can be understood
by the set transition being described by thermally-driven process to overcome a
given energy barrier E A [64]. As a result, the delay time Δt follows a Poissonian
distribution P(Δt) = 1/τ exp(−Δt/τ ), where τ is the characteristic time constant
given by τ = τ0 exp(E A /kT), where τ0 is a constant, k is the Boltzmann constant and
T is the local temperature [65].
For every set pulse in Fig. 4.3d, a random bit equal to 0 or 1 can be assigned based
on the set transition taking place in even or odd time window controlled by a given
constant frequency fC L K . By repeating the set transition several times, a random
bitstream can be generated. By using this scheme, an improved randomness quality
of the generated bitstream can be demonstrated, provided that Δt is sufficiently large
compared to the selected time window TCK and sufficiently small compared to the
overall width t P of the applied pulse (TC L K < Δt < t P ) [66]. Therefore, the inherent
randomness in the stochastic switching time received a big deal of interest as the
fundamental entropy source for stochastic computing [67], neuromorphic circuits
[68] and TRNG [10, 66]. Note that the sensitivity for switching is set by the window
between HRS and LRS, thus providing a more robust TRNG scheme compared to
the poorly predictable resistance change of RTN or 1/f noise.
Figure 4.8 shows the measured distributions of Δt for increasing voltages
V = 2.6 V (a), 3.2 V (b) and 3.6 V (c) [69]. The constant voltage was applied
to the device in the HRS state, while the delay time Δt was measured at the onset of
the set transition. The device was then reset with a negative voltage pulse, to allow
for a repeated set transition [69]. The results confirm the exponential Poissonian
distribution of the delay time Δt in Fig. 4.3d. Most importantly, the average Δt = τ
4 Applications of Resistive Switching Memory as Hardware Security Primitive 105

Fig. 4.8 Distributions of switching time delay for applied voltage of 2.6 V (a), 3.2 V (b), and 3.6 V
(c), with their corresponding fitting with the Poisson distribution. The only fitting parameter was
τ = 15.3, 1.2 and 0.029 ms for figure (a), (b) and (c), respectively. d Shows the voltage dependence
of τ . Reprinted with permission from [69]. Copyright (2008) American Chemical Society

in Fig. 4.8d decreases exponentially with the applied voltage, thus reflecting the
decrease of the effective energy barrier E A with the applied voltage [56, 64]. Data
in Fig. 4.8d highlights that, although the single switching event is stochastic, the
overall distribution of switching times can be predicted and controlled by the applied
voltage [68, 69].
The stochastic delay time was adopted as the entropy source for TRNG by the
circuit shown in Fig. 4.9a [66]. The proposed TRNG consists of a volatile RRAM
device with Ag TE and Ag-doped SiO2 dielectric layer. In this type of devices, the
Ag migration from the TE results in the formation of an unstable CF, which decays
soon after the set transition with a retention time ranging from few µs to few ms
[70–73]. The volatile behavior is due to the large diffusivity of Ag combined with
106 R. Carboni and D. Ielmini

Fig. 4.9 a Schematic representation of the TRNG circuit block diagram, comprising a memristive
device, a comparator, an AND gate, and a counter. b Pulsed waveforms at each stage of the circuit,
explaining the working principle of the TRNG. Reprinted from [66]. Creative Commons (2017)

the mechanical compressive stress in the dielectric layer [74] and the tendency to
minimize the surface to volume ratio of the CF [71]. In the TRNG circuit of Fig. 4.9a,
the volatile RRAM device is connected with a series resistance in a voltage-divider
configuration. The potential V2 of the intermediate node of the voltage-divider serves
as the input of a voltage comparator. The comparator output and a clock pulse serve
as the input of an AND gate, and a counter reads the AND output. A TRNG cycle is
shown in Fig. 4.9b: the application of a voltage pulse V1 (1) causes a set transition
in the RRAM device after a stochastic Δt, which causes V2 to suddenly increase
above the reference Vr e f (2), thus making the comparator output go to a high logic
level V3 (3). Due to the stochastic Δt, the V3 pulse has a random duration, which is
measured by the counter in units of the clock period TC L K . Note that the binary bit
(6) flips between 0 and 1 for the whole duration of the V3 pulse, eventually resulting
in a random bit. Note that a nonvolatile RRAM could be adopted in this scheme as
well, however, a reset pulse would be needed to reinitiate the device for a new cycle.
The use of volatile RRAM in this case makes the TRNG algorithm easier and more
energy efficient, as no reset pulse is needed.
4 Applications of Resistive Switching Memory as Hardware Security Primitive 107

To match the time window TC L K < Δt < t P , the pulse voltage V1 should be
carefully tuned, which usually requires complicated probability tracking techniques
[75]. Also, extracting entropy from the stochastic switching time can be difficult due
to its sensitivity on device parameters and process variations, requiring a probability
tracking of the applied voltage for every TRNG on the same chip, or in separate chips
[76].

4.6 TRNG Based on Stochastic Voltage

A promising and more robust TRNG relies on the exploitation of the stochastic
switching voltage. Namely, instead of measuring the delay time Δt for switching,
one can monitor the device for a given amount of time, where the switching probabil-
ity becomes the stochastic entropy source. This approach is schematically depicted
in Fig. 4.3e, where various current–voltage characteristics measured on the same
RRAM device demonstrate a distribution of set voltage (Vset ), due to the cycle-to-
cycle variation. The application of a voltage equal to the average transition voltage
<Vset > to the device in the HRS will then lead to a set transition with 50% prob-
ability. As a result, the measured resistance of the device after the applied voltage
pulse then shows a bimodal distribution as indicated in Fig. 4.3f, where the two
sub-distributions correspond to LRS and HRS. The random bitstream can thus be
generated by associating the LRS and HRS to bit values “0” and “1”, respectively
[33]. A similar scheme can be extended to stochastic computing, where an ana-
log value can be obtained as the sequence of stochastic bimodal resistance values
obtained from the same device [68]. To illustrate the voltage-based TRNG con-
cept, Fig. 4.10a shows the measured I–V curves for the same RRAM device with
1T1R configuration for six successive set/reset cycles [33]. The switching parame-
ters, such as set and reset voltages, and the HRS and LRS resistance values show
a large variability from cycle to cycle, which can be explained by considering the
physics of the random formation and disruption of the conductive filament [57].
Figure 4.10b shows the pulse sequence for characterizing the random set transition
process, including: (1) a positive set pulse to deterministically initialize the device
in LRS, (2) a negative reset pulse with a stop voltage Vstop to induce transition to
the HRS, (3) a positive set pulse with voltage (V A ) close to <Vset > to stochastically
induce a set transition event, and (4) a read pulse to measure the resistance in the
final state. Figure 4.10c shows the resulting resistance distribution for a random set
experiment with V A = 1.6 V. Data shows a bimodal distribution, corresponding to
LRS sub-distribution with R ≈ 12 kΩ and HRS sub-distribution above 100 kΩ.
The origin of the bimodal distribution is clarified in Fig. 4.10d, which shows three
characteristic I–V curves for various stochastic events, corresponding to state A,
B, and C in Fig. 4.10c. Case A corresponds to a cycle where Vset was higher than
the applied V A , due to a relatively high HRS after the reset pulse. As a result, no
set process took place in this case, thus the resistance was found in the HRS sub-
distribution (Fig. 4.10b). Case C corresponds to Vset being smaller than V A , thus
108 R. Carboni and D. Ielmini

Fig. 4.10 a Measured I–V characteristics for six cycles on the same 1T1R structure, evidencing
stochastic switching. b Sequence of applied pulses for TRNG, with (c) the cumulative distribution
of read resistance. Random set process is highlighted in the three I–V curves (d), corresponding to
states A, B and C in (c). Reprinted with permission from [33]. Copyright (2015) IEEE

resulting in a set transition with a compliance current IC = 50 µA controlled by the


MOS transistor connected in series with the device. After the set event, the resistance
falls within the LRS sub-distribution in Fig. 4.10b. Finally, the intermediate case B
corresponds to an applied V A close to Vset . In this case, the device undergoes set
transition, however, cannot complete the transition within the pulse time. This results
in an intermediate resistance between LRS and HRS, constituting the flat region of
the bimodal distribution between HRS and LRS sub-distributions in Fig. 4.10b. It
has been shown that this flat region, i.e., the occurrence of intermediate cases of
type B, can be minimized by using a reduced pulse width or a proper shape (e.g., a
saw-tooth shape with abrupt drop after reaching V A ) [33]. A key requirement for
the TRNG in Fig. 4.9 is the absence of memory effects in the entropy harvesting
process. To support this point, Fig. 4.11a shows the measured resistances during
500 successive random set cycles, clearly evidencing a bimodal distribution [33].
The absence of memory effect is further demonstrated in Fig. 4.11b, showing the
correlation plot of R in cycle i+1 as a function of the R in cycle i for all the cycles
showed in Fig. 4.11a. We can identify four different regions, corresponding to the
cell being in the same state (LRS or HRS) or different states in the two consecutive
cycles. Figure 4.11c shows the histogram representation of the probability for the
four regions, showing comparable values around 25%. This demonstrates the lack
4 Applications of Resistive Switching Memory as Hardware Security Primitive 109

Fig. 4.11 Measured resistance for 500 random set cycles with Vstop = −1.45 V and V A = −1.6 V
(a), correlation of R in cycle i+1 as a function of R in cycle i (b) and (c) population of the four
regions in (b). Reprinted with permission from [33]. Copyright (2015) IEEE

of correlation across two consecutive cycles, which is consistent with true random-
ness of the bit stream. To guarantee proper RNG operation, a positive-feedback
regeneration of the analog output values might be required. Figure 4.12a shows a
compact regeneration circuit [33], comprising a RRAM device in 1T1R structure as
the first stage and a CMOS inverter as the second stage. This scheme takes advan-
tage of the relatively large resistance window between LRS and HRS, thus allowing
the use of a small CMOS inverter instead of the larger analog comparator, which is
instead typically required for recovering the small signal in RTN-based RNG [29].
Figure 4.12b shows the Vin –Vout characteristics of the CMOS inverter, evidencing
the high gain in the transition region (with a threshold voltage VT = 0.4 V) which
allows for digital restoration. The impact of this regeneration circuit on the random
bit distribution is illustrated in Fig. 4.13, showing measured and simulated bimodal
resistance distributions (a), the simulated digital bimodal distribution of the inverter
output Vout (b) and the sequence of the output voltage Vout for 2 × 105 cycles (c).
To achieve a sufficient uniformity of the generated random bits, the applied voltage
should be finely tuned to match the exact value <Vset >. This requires a preliminary
110 R. Carboni and D. Ielmini

Fig. 4.12 Regeneration


circuit (a), comprising the
1T1R RRAM device and a
CMOS inverter. b Vin –Vout
characteristic of the inverter.
Reprinted with permission
from [33]. Copyright (2015)
IEEE

probability tracking procedure [76], which results in a certain overhead in terms of


complexity, area and power consumption.

4.7 Differential TRNG Schemes

To overcome the need for a probability tracking in voltage-based TRNG, various dif-
ferential schemes have been recently developed [34, 75]. In these TRNGs, either the
competition between two RRAM devices [34] or the comparison between consecu-
tive cycles on the same device [75] yields high-quality entropy without probability
tracking, thus with a relatively simple circuit layout. A typical differential scheme
relies on the coupling of two RRAM devices in either series or parallel configurations
with the entropy source being the variability of set or reset transitions [34]. Three
different schemes were proposed, namely: (a) parallel reset, (b) series reset and (c)
parallel set, as detailed in the following [34]. Figure 4.14a shows the parallel-reset
TRNG circuit, comprising two RRAM cells, referred to as P and Q, connected in
parallel. The common BE is connected to a comparator for the differential read.
Figure 4.14b shows the waveform applied to the TE of devices P and Q, i.e., V P
and V Q , respectively, and the voltage Vout of the common BE node between P and
Q. During a TRNG cycle, the applied waveforms include three phases, namely, (1)
a positive voltage is applied across both P and Q in parallel, inducing set transition
at both devices, (2) a negative voltage is applied across P and Q in parallel, induc-
4 Applications of Resistive Switching Memory as Hardware Security Primitive 111

Fig. 4.13 Measured and


calculated distribution of the
RRAM resistance (a),
simulated distribution of the
inverter output voltage Vout
(b) and measured Vout
for 2 ×105 cycles. Reprinted
with permission from [33].
Copyright (2015) IEEE

Fig. 4.14 a Parallel-reset differential scheme for TRNG and b sequence of applied signals. Both
P and Q start in HRS and are independently set, then reset and finally read using a voltage-divider
configuration. The analog comparator (CMP) digitally restores the output signal. Reprinted with
permission from [34]. Copyright (2016) IEEE

ing reset transition in both devices, (3) a differential read phase where +Vr ead and
−Vr ead are applied at P and Q with floating BE to test the voltage divider between P
and Q. Depending on the resistance values of P and Q, namely R P and R Q , respec-
tively, the output voltage is found to be positive or negative, thus dictating the value
of the output random bit. Given the relatively large variability of the HRS resistance
112 R. Carboni and D. Ielmini

[33, 57], Vout varies stochastically from cycle to cycle, thus constituting the basis
for random bit generation. In this first approach, HRS resistance variation acts as
the entropy source. Note that the bit value probability is automatically set to 50%
by the uniform cycle-to-cycle distributions of HRS resistance of P and Q, as the
cycle-to-cycle variation in RRAM is comparable to the cell-to-cell variation [77].
Figure 4.15a shows the cumulative distributions of measured and calculated R P
and R Q , both after set and after reset. The read Vout distributions are shown in
Fig. 4.15b for experimental and calculated data, indicating a bimodal shape with
50% transition probability. By reading the voltage Vout with an analog comparator
(Fig. 4.14a), the bimodal distribution can be improved, as shown by the distribution

Fig. 4.15 Cumulative distributions of resistance after set and after reset for cell P and Q (a).
b Distributions of the output voltage Vout and Vout2 , before and after the CMP, respectively. c
Measured Vout and Vout2 for 1000 RNG cycles with the corresponding PDFs (d). Reprinted with
permission from [34]. Copyright (2016) IEEE
4 Applications of Resistive Switching Memory as Hardware Security Primitive 113

Fig. 4.16 a Series reset differential scheme for TRNG and b sequence of applied signals. From the
HRS, the cells are independently set, then they undergo a random reset, during which only one can
switch, and finally they are read in voltage-divider configuration. Reprinted with permission from
[34]. Copyright (2016) IEEE

of the comparator output Vout2 in Fig. 4.15b. The bulky comparator may be replaced
by a CMOS inverter, thus reducing the on-chip area occupation [33]. To demonstrate
the cycle-by-cycle operation of the parallel-reset scheme, Fig. 4.15c shows Vout
and Vout2 for 1000 consecutive cycles, while Fig. 4.15d shows their corresponding
probability density function. The TRNG does not require any probability tracking
thanks cycle-to-cycle variability being comparable to the cell-to-cell variability [77].
Figure 4.16a shows an alternative differential TRNG scheme, namely the serial reset
configuration. This comprises two RRAM devices connected in series with V P and
V Q as supply voltages and the intermediate node of potential Vout connected to an
output comparator. Figure 4.16b shows the applied waveform of V P , V Q and Vout
during a TRNG cycle, consisting of (1) independent set of P and Q, (2) random reset
of either P or Q, (3) differential read of Vout . For simplicity, we assumed V Q = −V P
in the figure. During the random reset event, a negative voltage V P − V Q < 0 is
applied to the two devices in series, while the common node is left floating. A total
applied voltage |V P − V Q | > 2 Vr eset drops across the devices, thus inducing reset
transition in one of the two devices. In fact, once the transition begins in one of the
two cells, the voltage across it increases because of the voltage-divider effect, while
the voltage drop across the other device decreases, thus preventing the two devices
to both undergo reset transition. This configuration thus realizes a positive feedback,
resulting in a self-accelerated reset event that takes place randomly in one device
only. Specifically, the reset transition takes place in the device with the smallest
Vr eset . Because of the cycle-to-cycle variability of Vr eset , the probability for one
device to reset is ideally 50% [57]. Figure 4.17a shows the cumulative distribution
of R P and R Q after set and reset pulses in Fig. 4.16b [34]. After the random reset
pulse, both P and Q show the same bimodal distribution with transition point at
50% probability, thus demonstrating unbiased TRNG with no need for probability
tracking. To gain further insight on the random reset process, Fig. 4.17b shows the
correlation plot of R Q as a function of (R P ) after either set or reset. R P and R Q
appear to be anti-correlated after the reset phase, namely R P is high for low R Q
and vice versa, which thus reveals a conditional reset of one RRAM device only.
114 R. Carboni and D. Ielmini

Figure 4.17c shows the distributions of experimental and calculated Vout , indicating
a bipolar mode with transition point at 50% probability. Similar to other TRNG
schemes, a digital regeneration can be obtained by a comparator or a CMOS inverter.
Figure 4.17d shows the cycle-to-cycle values of Vout and Vout2 during the application
of the RNG pulse scheme of Fig. 4.16b. Note that after each differential read phase,
a final deterministic reset pulse was applied to ensure equal HRS conditions in P
and Q before the application of the set pulse. Figure 4.17e shows the corresponding
distributions of Vout and Vout2 for both data and calculations [34]. Figure 4.18a shows
the parallel set scheme [34], where the two RRAM devices in parallel configuration
are connected to a common select transistor, with the drain terminal connected to
the input node of a comparator. Figure 4.18b shows the applied waveform cycle,
including (1) an independent reset of P and Q, (2) a random set pulse of P and Q, and
(3) a differential read by the application of a voltage 2 Vr ead across the two devices,
while the transistor is biased in the off state. This TRNG scheme is based on the
one-transistor/two-resistor (1T2R) structure in Fig. 4.18a, where the application of a
positive voltage across the devices causes set transition to take place randomly in one
of the two devices first. As a result of the transition to LRS and the voltage-divider
effect with the transistor, the voltage drops across both devices, which prevents any
set transition to take place in the second RRAM device. In this TRNG scheme, the
cycle-to-cycle variability of Vset plays the role of entropy source. Figure 4.19a shows
the read resistance distributions for P and Q, evidencing the expected bimodal shape
with HRS/LRS transition at 50%. In order to verify that the random set happens
stochastically in either one of the devices, Fig. 4.19b shows the correlation plot of
R Q as a function of R P , again indicating an anti-correlation where P is in HRS for Q
in LRS, and vice versa. Finally, Fig. 4.19c shows the cycle-to-cycle output values of
Vout and Vout2 , while Fig. 4.19d shows their corresponding probability distributions.
Comparing these solutions for entropy harvesting, different performances are
apparent in terms of bimodal distribution of R and Vout . For instance, the parallel-set
TRNG (Fig. 4.19) shows improved results with respect to the parallel-reset TRNG
(Fig. 4.15). This can be understood considering the abrupt set transition in the parallel
set process as opposed to the more gradual reset event in the parallel-reset process.
The abrupt set transition is explained by the physical positive feedback where the
first initiation of the filament causes an increase of the local Joule heating, thus
accelerating the further growth of the filament [57]. This highlights the key role of
the physics of the entropy-generating process has in controlling the quality of the
TRNG circuit.
A general drawback of the differential pair approach is the assumption that cycle-
to-cycle variation dominates over the cell-to-cell variation. In presence of a large
mismatch between the two cells in the differential pairs, e.g., where one cell system-
atically displays a lower Vset than the other cell, the TRNG might show deviations
from the uniform behavior. Although this might be acceptable for PUF applications,
where the random unique key has to be generated only once in the lifetime of the
device, it might cause non-acceptable nonuniformities in TRNG [34].
4 Applications of Resistive Switching Memory as Hardware Security Primitive 115

Fig. 4.17 a Cumulative distributions of R after set and after reset for both cells P and Q. b Corre-
lation plot of R Q as a function of R P . c Cumulative distributions of Vout and Vout2 . d Measured
Vout and Vout2 during RNG cycling and e corresponding PDF. Reprinted with permission from
[34]. Copyright (2016) IEEE

Fig. 4.18 a Parallel set differential scheme and b sequence of applied signals. From the LRS, the
cells are first independently reset, the subjected to parallel set, and finally read with voltage-divider
configuration. Reprinted with permission from [34]. Copyright (2016) IEEE
116 R. Carboni and D. Ielmini

Fig. 4.19 a Cumulative distributions of R after set and after reset for P and Q. b Correlation plot
of R Q as a function of R P after set and reset. c Cumulative distributions of Vout and Vout2 during
RNG cycling, and e corresponding measured Vout and Vout2 PDF. Reprinted with permission from
[34]. Copyright (2016) IEEE

4.8 STT Magnetic Memory for TRNG

The presented TRNG schemes can be adopted for all stochastic memory devices, e.g.,
the phase change memory (PCM) or the STT-MRAM. In particular, STT-MRAM
offers improved cycling endurance [78] and fast switching [79] which might benefit
the TRNG operation by providing an extended lifetime and throughput. Figure 4.20a
shows a typical state-of-the-art STT-MRAM device, consisting of a magneto-tunnel
junction (MTJ) with perpendicular magnetic anisotropy (PMA) [78]. The MTJ con-
sists of a pinned layer (PL) and a free layer (FL), acting as bottom electrode (BE) and
top electrode (TE), respectively, and both made of ferromagnetic CoFeB. Between
the two electrodes, a dielectric layer made of crystalline MgO serves as the tunnel-
ing barrier to induce the MTJ effect [80]. As schematically shown in Fig. 4.20b, this
memory device has two stable states, where the magnetic polarization of the FL can
be either parallel (P) or antiparallel (AP) to the magnetization of the PL, resulting
in low or high resistance of the MTJ, respectively [78, 80]. Figure 4.20c shows the
4 Applications of Resistive Switching Memory as Hardware Security Primitive 117

Fig. 4.20 a Typical STT-MRAM device, consisting in a magnetic tunnel junction (MTJ) stack. b
Energy as a function of the FL magnetic polarization direction with respect to the PL, showing P
and AP states. c Measured and calculated I–V and d R–V pulsed characteristics with 1 µs pulse
width. Reprinted with permission from [75]. Copyright (2018) IEEE

measured current–voltage (I–V) characteristics, while the corresponding resistance-


voltage (R–V) characteristics is shown in Fig. 4.20d. Set transition from AP to P,
and reset transition from P to AP, take place at the positive voltage Vset and at the
negative voltage Vr eset , respectively.
As for the RRAM device, set and reset transitions in STT-MRAM are affected
by stochastic switching, thus introducing a randomness causing a voltage-dependent
bit error rate (BER) in memory applications [79]. The inherent stochastic switching
causes cycle-to-cycle variations of both Vset and Vr eset [81]. Although showing
apparently similar variability, the physical origin of the stochastic switching voltage
is quite different in STT-MRAM and RRAM. In fact, the statistical variations in
STT-MRAM switching can be explained by the thermally-assisted magnetization
reversal [82], where the transition from AP to P and vice versa are induced by a
random thermal fluctuation within the potential well of Fig. 4.20b, and a stochastic
transition over the energy barrier E A between the two states. As a result, for each
applied positive or negative voltage V A , there is a statistical distribution of set time
tset or reset time tr eset , respectively.
The stochastic switching in STT-MRAM has been used for various TRNG con-
cepts, either based on the time variation [31, 83] or the voltage variation [30, 76].
In particular, in the work from Vodenicarevic et al. [83] the stochastic switching
time was exploited through an MTJ stack engineering. Namely, a low-stability (i.e.,
characterized by a reduced magnetic stability) free layer was introduced instead of
relatively high-stability nanomagnet used in memory applications. This structure
is referred to as superparamagnetic tunnel junction [84] and shows spontaneous
stochastic switching between the two stable states due to low-stability relative to
thermal fluctuations.
However, all these schemes necessarily rely on a careful biasing configuration,
thus requiring a probability tracking approach to ensure the TRNG uniformity. Proba-
bility tracking can be avoided by using differential concepts, however, the differential
pair approach is affected by the cell-to-cell mismatch within the pair. To solve these
118 R. Carboni and D. Ielmini

Fig. 4.21 a Measured rectangular voltage pulses and current response for 2 consecutive cycles
n−1 and n, b PDF of the integrated current Qn and c PDF of differential charge ΔQn = Qn − Qn−1 .
The pulse sequence includes positive and negative rectangular pulses for stochastic set and reset
transitions, respectively, as evidenced by the abrupt steps in the current response. The random bit
is assigned from the value of ΔQn in (c). Reprinted with permission from [75]. Copyright (2018)
IEEE

issues, a novel differential concept was presented, where the consequent switch-
ing cycles are compared in the same device, instead of two coupled devices [75].
Figure 4.21a shows the applied voltage and the device current response over two con-
secutive set/reset cycles. In each cycle, a stochastic pulse with positive V+ is applied,
followed by a deterministic pulse with negative V− . Both pulses have a pulse duration
of 1 µs, although the concept can be easily scaled to a shorter pulse width thanks to
the high speed of the switching process in the STT-MRAM. The stochastic switch-
ing is evidenced in Fig. 4.21a, where a shorter delay time tset is observed during
cycle n−1 with respect to cycle n. the TRNG relies on the comparison between the
current responses between two consecutive cycles of the same STT-MRAM device. 
Figure 4.21b shows the probability distribution of the integrated current Q n = idt
while Fig. 4.21c shows the corresponding difference ΔQ n = Q n − Q n−1 . Given the
highly symmetric distribution of ΔQ n , the latter is chosen as the statistical variable
for random bit generation, where a random bit value 0 or 1 is assigned for ΔQ n < 0
or ΔQ n > 0, respectively [75].
Figure 4.22a shows the same concept for TRNG applied to the case of a triangular
waveform. Both positive and negative triangular pulses are applied for stochastic
set and deterministic reset, respectively. In this case, the stochastic switching is
evidenced by the different set and reset voltage in cycles n−1 and n, resulting in
different current waveform during the two consecutive cycles. Figure
 4.22b shows the
distribution of the integrated current over a single cycle Q n = idt while Fig. 4.22c
shows the difference ΔQ n = Q n − Q n−1 over two consecutive cycles, serving as the
stochastic variable for bit generation. In the TRNG concepts illustrated in Figs. 4.21
4 Applications of Resistive Switching Memory as Hardware Security Primitive 119

Fig. 4.22 a Measured triangular voltage pulses and current response for 2 consecutive cycles n−1
and n, b PDF of the integrated current Qn and c PDF of differential charge ΔQn = Qn − Qn−1 .
The pulse sequence includes positive and negative triangular pulses for stochastic set and reset
transitions, respectively, as evidenced by the abrupt steps in the current response. The random bit
is assigned from the value of ΔQn in (c). Reprinted with permission from [75]. Copyright (2018)
IEEE

and 4.22, the entropy source is either the stochastic distribution of switching time,
or the stochastic distribution of switching voltage, respectively [75].
Generally, TRNG concepts require further whitening algorithm, such as the Von
Neumann correction [76] or the XOR operation [83], to achieve a truly unbiased
bitstream. However, the scheme of Figs. 4.21 and 4.22 can pass the standard test of
the National Institute for Standards and Technology (NIST) [63] without any post-
processing, thus enabling a reduced energy and area overhead of the TRNG circuit
[75]. Figure 4.23 reports the pass rate for the nonoverlapping template test in the
NIST criteria as a function of pulse voltage for rectangular and triangular pulses. The
TRNG with rectangular pulse shows an acceptable accuracy only in correspondence
of a narrow window of voltage, with a randomness degradation for both high and low
voltages. On the other hand, the TRNG with the triangular pulse shows high pass rate
over the whole test range, demonstrating a high voltage-independent randomness.
These results can be explained by considering the applied voltage (V A ) dependence
of the switching parameters tset and Vset (or tr eset and Vr eset ) for rectangular and
triangular pulses [75]. Considering a rectangular pulse, the set time tset can be written
as [85]:
V
tset = τ0 exp (Δ(1 − )), (4.1)
V0

where V0 and τ0 are constants, V is the applied voltage, and Δ is the thermal stability
factor. Given the exponential dependence in (4.1), there is only a narrow window
of voltages where the switching time tset is comparable to the applied pulse width
(Fig. 4.21a). On the other hand, the set voltage under a triangular pulse, where the
applied voltage is ramped according to V (t) = 2V A t/t P , can be estimated from
120 R. Carboni and D. Ielmini

Fig. 4.23 Pass rate of the nonoverlapping template NIST test as a function of pulse voltage for
rectangular and triangular pulses. The pass rate is referred to a total of 148 tests. Rectangular pulses
show an operation window around 0.6 V, whereas triangular pulses show voltage-independent high
randomness. Reprinted with permission from [75]. Copyright (2018) IEEE


the switching integrated probability reaching one, namely 1/tset dt = 1, with tset
defined by (4.1). Thus, the set voltage along a triangular pulse is given by [64, 82]:

t0 V A
Vset ≈ V0 ln , (4.2)
V0 t P

suggesting a logarithmic dependence of Vset on the maximum applied voltage V A .


This explains the voltage-independent high entropy for the triangular pulse scheme
with respect to the rectangular pulse in Fig. 4.23. Owing to this different dependence,
the time-based scheme (rectangular pulse) might still require some probability track-
ing to find the correct V A for optimal performance. In general, differential reading
schemes based on stochastic voltage look more promising with respect to schemes
based on stochastic time thanks to a lower sensitivity to the external biasing. For
example, the application of an external magnetic field or change in temperature
would only affect the switching threshold of the triangular pulse scheme, but not its
cycle-to-cycle variability, which acts as the entropy source. On the other hand, for
the rectangular pulse scheme, an external bias would change the voltage window for
maximum entropy, requiring a re-tuning of the applied voltage.

4.9 PUF Implementations

The RRAM device variability sources discussed for TRNG can in principle be
adopted for PUF systems, thus enabling a small area, low power consumption,
and high PUF performance in terms of uniqueness and reliability. For instance,
the stochastic resistance variation in RRAM was proposed for a reconfigurable PUF
[86]. Figure 4.24a shows the calculated lognormal distributions of RRAM resis-
tance for LRS and HRS. Figure 4.24b is a sketch of a PUF circuit consisting of
an RRAM array where each cell represents a single bit and can be initialized in
either LRS or HRS. The challenge consists of the address of two n-bit data, while
the response is the bit-wise comparison of the RRAM resistance of the two data. In
4 Applications of Resistive Switching Memory as Hardware Security Primitive 121

Fig. 4.24 a Simulated resistance distributions for LRS and HRS, following normal and lognormal
distributions, respectively. b Schematic illustration of a PUF implementation exploiting RRAM
resistance variability. Reprinted with permission from [86]. Copyright (2014) IEEE

this PUF concept, the stochastic switching allows for the reconfiguration of the PUF
by reprogramming the RRAM array, in stark contrast with systems based on fixed
manufacturing variations. PUF reconfigurability significantly enhances security pro-
tocols based on authentication [87], since it allows to overcome the limitations due
to device degradation or small CRP set. Figure 4.25 shows the characterization of the
PUF against three of the performance parameters in Sect. 4.2, namely, unpredictabil-
ity, unclonability, and reliability. First, the unpredictability of the PUF response can
be measured by studying the output bit uniformity. Figure 4.25 shows the character-
ization of the PUF against three of the performance parameters in Sect. 4.2, namely
unpredictability, unclonability and reliability. First, the unpredictability of the PUF
response can be measured by studying the output bit uniformity. Figure 4.25a shows
“1” bias distributions of 256-bit responses, thus supporting a uniform output, also
confirmed by the almost equal probabilities of 3-bit responses in Fig. 4.25b. Second,
the unclonability requires that the physical (or mathematical) CRP mapping cannot
be replicated, which in turn requires a strong uniqueness of PUF to distinguish a
specific chip from another. This property can be assessed as the Hamming distance
(HD) between the responses of two different PUFs to the same challenge. It is also
referred to as the inter-chip HD (HDinter ), which should be ideally 50%. Figure 4.25c
shows the calculated HDinter for 100 PUF samples of 256 kb RRAM arrays, demon-
strating an ideal HDinter close to 50%. Finally, reliability refers to the ability of a
PUF of giving always the same response to a given challenge. To evaluate the PUF
reliability, the intra-chip HD (HDintra ) can be calculated in this case among different
responses to the same challenge for the same PUF under different conditions (such
as temperature). The HDintra should be 0% for an ideal PUF, and a large separation
between HDinter and HDintra reduces false identification rate [86]. HDintra might
be affected by the dependence of RRAM resistance on temperature and voltage.
For instance, Fig. 4.25d shows the resistance as a function of temperature for two
122 R. Carboni and D. Ielmini

Fig. 4.25 a Distribution of the uniformity measured by “1” bias of a PUF implemented on a
256 kbit array. The relatively uniform output is demonstrated by the uniform occurrence of the
3-bit responses (b). c Uniqueness measured by HDinter distribution. d A resistance crossing event
between two different cells at increasing temperature, which causes a bit flipping and consequently
a reliability degradation. e Effect on HDintra distributions under different voltage fluctuations. f
HDintra distributions at different temperatures. Reprinted with permission from [86]. Copyright
(2014) IEEE

RRAM cells with two different activation energies [86]. Note the crossing between
the two resistance values at high temperature, thus resulting in a bit flip and a con-
sequent reduction of the reliability. Figures. 4.25e–f shows the impact of voltage
4 Applications of Resistive Switching Memory as Hardware Security Primitive 123

Fig. 4.26 a Schematic illustration of the resistive crosspoint array, which implements a strong PUF
by exploiting the sneak paths. b Distributions of cell current before and after the one-time program-
ming, showing quite large analog distribution. Reprinted with permission from [43]. Copyright
(2016) IEEE

and temperature on reliability, described by the parameter HDintra . In general, PUF


implementation with RRAMs requires that the spatial (i.e., cell to cell) variability
dominates over temporal variability (i.e., noise) [86]. As a result, particular attention
should be paid on device retention properties to minimize possible aging effects that
might reduce the window between HDintra and HDinter . To develop a strong PUF, not
only the RRAM randomness and reliability, but also the circuit implementation of
the response function should be robust enough. Figure 4.26a shows a possible PUF
implementation based on a crosspoint RRAM array [43]. Here, the entropy source
is provided by the large analog resistance distribution of the RRAM. The current
sneak path is then exploited to go beyond the typical limitation of memory-based
PUF, which have a limited set of CRPs. Note that for memory applications the sneak
path effect is detrimental for cell read-out margin [88]. On the other hand, sneak
path provides the unclonable function in this case, enabling an exponential scaling
of the CRP set, which is required for a strong PUF. In the N × N crosspoint PUF
of Fig. 4.26a, the challenge consists of a N-bit vector applied to the N rows, where
an input bit value of 1 corresponds to an applied voltage equal to V D D , while the
row is left floating for a bit value of 0. The current from the N columns is then read
and converted to an N-bit response by a sense amplifier. Theoretically, the maximum
number of CRPs is 2N , since each row may be either floating or with an applied
voltage. The actual number of CRPs is reduced since 50% of the rows are required
to be biased in order to generate a comparable range of column currents for different
challenges [43]. It is estimated that CRP set is around 5 × 1075 for an array of 256
× 256 bits. RRAM devices in the array are initialized only at the beginning of the
PUF operation, resulting in large cell current variability thanks to the variation in
switching dynamics (Fig. 4.26b).
124 R. Carboni and D. Ielmini

Fig. 4.27 a Distributions of HDinter of 12-bit responses for 11 different input vectors. b Measured
read current for 12 column as a function of time at T = 120 ◦ C. c HDintra of 12-bit responses to
the same challenge as a function of time for three different temperatures T = 100, 120 and 140 ◦ C.
Reprinted with permission from [43]. Copyright (2016) IEEE

The performance of the crosspoint PUF in Fig. 4.26 is evaluated in terms of


the experimental HDinter and HDintra . In particular, the uniqueness is evaluated by
HDinter by comparing the responses across 28 PUF instances. Figure 4.27a shows
HDinter distributions for 11 different challenges. The average HDinter ≈ 46.2% is
sufficiently close to the ideal 50%, thus demonstrating a good uniqueness. In addition,
a good PUF reliability requires a sufficient retention of the array cell resistance
state. To this purpose, the output currents (i.e., the responses) were measured as a
function of time for increasing temperature. Figure 4.27b reports the results of an
annealing experiment for T = 120 ◦ C as a function of time, underlining the RRAM
variation with time as already demonstrated for HfOx RRAM [89]. The results are
summarized in Fig. 4.27c as HDintra for increasing temperature T = 100, 120 and
140 ◦ C, showing an increasing value of HDintra , from 0 to 8%. Note that HDintra
and HDinter distributions do not overlap, as the minimum for HDinter is around
17% (see Fig. 4.27a), demonstrating the feasibility of the crosspoint PUF concept
as hardware security primitive. Embedding resistive devices in security primitives
allows for their hardware reconfigurability, which opens new possibilities for secret
keys management. A key-based permission granting system requires eventual key
erasure, after the permissions have been revoked. This system allows for logic locking
[90], which is used against intellectual property (IP) theft and circuit counterfeiting.
However, proving that the digital key has been erased is a difficult task. More in
general, a security protocol with erasable PUF responses is desirable [44].
Recently, a provable key destruction scheme based on memristive devices was
demonstrated [91] with a 128 × 64 Ta/HfO2 crosspoint array, shown in Fig. 4.28a. The
unclonable fingerprint is derived by comparing the conductance value of neighboring
4 Applications of Resistive Switching Memory as Hardware Security Primitive 125

Fig. 4.28 a Schematic of the crosspoint array enabling secure fingerprint extraction only after
provable key erasure, where the fingerprint is given by the comparison of LRS conductance between
two neighboring memristor cells. b Typical 128 × 32 fingerprint that can be generated from a 128 ×
64 memristor array. Reprinted with permission from Macmillan Publishers Ltd: Nature Electronics
[91]. Copyright (2018)

cell pairs in the array, after initializing all of them in the LRS. The random bit
identifying each pair is set to “1” if G L R S,le f t ≥G L R S,right , to “0” otherwise. Owing
to the intrinsic variability of LRS, a random pattern (i.e., the fingerprint)is generated
to identify uniquely the device, as shown in (Fig. 4.28b). Figure 4.29 shows the
experimental demonstration of provable key destruction. Here, an initial fingerprint
(FPchi p , Fig. 4.29a) is generated and securely stored in a trusted database. Then, a
random key (Kchi p , Fig. 4.29b) is written in the array, thus preventing the re-writing of
FPchi p without losing Kchi p . Kchi p is also sent to the trusted party, so that it can be used
for unlocking features of the specific chip instance storing Kchi p . When a key erasure
is necessary, the user simply reinitializes the array to the LRS, therefore destroying
the key Kchi p and generating a new fingerprint (FP’chi p , Fig. 4.29c), which constitutes
the demonstration of key erasure. The new fingerprint FP’chi p is finally sent to the
trusted party for comparison with the previously stored FPchi p . If the HD between
the two fingerprints is compatible with the expected distance between fingerprints of
the same chip, then the chip can be authenticated by the trusted party. In addition, the
trusted party also gets confirmation that Kchi p has been erased, since it is required
for generating a valid FP’chi p . The practical feasibility of the described concept is
demonstrated in Fig. 4.29d, showing that the distribution of HD for the same chip is
clearly separated from the distribution of HD for different chips. Figure 4.29e shows
the same distributions for 256-bit fingerprint, where the improved separation between
the two distribution supports the need for a large number of bits in the fingerprint.
126 R. Carboni and D. Ielmini

Fig. 4.29 a Initial fingerprint FPchi p stored by the trusted party. b Digital key Kchi p written in
the memristor array. c A second fingerprint FP’chi p generated by the same array, thus destroying
the key. d HD distributions of 128-bit fingerprints from same chip and different chips, showing
sufficient separation, hence demonstrating the feasibility of the scheme. e The same comparison is
given for 256-bit fingerprints. Reprinted with permission from Macmillan Publishers Ltd: Nature
Electronics [91]. Copyright (2018)

4.10 Summary and Conclusions

The exponential increase of internet-based communication devices is raising the


demand for data/hardware security. A severe challenge is the limited area and power
for IoT devices, which spurs the research on low power, high-performance hardware
security blocks such as TRNG and PUF. While TRNGs are essential for encryption
adopted in data and transmission security, PUFs are becoming the preferred solution
for hardware authentication and verification.
The chapter provides an overview of TRNGs and PUFs based on emerging resis-
tive switching memory technology. We review the various schemes for using a
nanoscale device as entropy source, including stochastic noise, stochastic switch-
ing delay time and stochastic switching voltage. The various implementations are
discussed in terms of simplicity of the concept and the stability over various oper-
ating condition, such as process, voltage, and temperature. The effectiveness of
differential schemes for TRNGs, which do not require any probability tracking to
tune the operating voltage and/or time, is also discussed and emphasized.
While the status of memory-based security primitives is already encouraging,
there are still many challenges toward a practical implementation of these concepts
4 Applications of Resistive Switching Memory as Hardware Security Primitive 127

in IoT and other integrated systems. In particular, device optimization needs to be


focused on high-frequency operation (>1 Gbit/s), low-energy per bit (tens of fJ
range), aggressive area scalability (1× nm node) and infinite endurance. Most impor-
tantly, a CMOS-compatible technology is paramount for an easy integration capabil-
ity. The device should also be engineered toward enhancing the stochastic behavior,
which is generally unwanted and intentionally suppressed in memory applications.
A differentiation of the device geometry, materials, and operation algorithms toward
optimized random performance might be needed for TRNG and PUFs. From the cir-
cuital point-of-view, the research effort should focus on design solutions which min-
imize the area, power, and circuit overhead. Clearly, this means that TRNG schemes
which do not require any post-processing algorithm or entropy-tracking feedback
should be preferred. In general, a thorough device/circuit co-design methodology is
extremely important and should be carefully explored. Finally, a fascinating direction
of research is the hardware reconfigurability, where the same fundamental structure
(e.g., a crosspoint memory array) is used for either memory, computing (e.g., as a
hardware primitive for stochastic/neuromorphic computing), or hardware security.
This offers new possibilities for ultra-small/low-power IoT devices, which would be
able to perform a wide range of tasks (e.g., pattern recognition and classification,
fast/low-power analog computation, authentication, etc.) within a single hardware
chip.

Acknowledgements This article has received funding from the European Research Council (ERC)
under the European Union’s Horizon 2020 research and innovation programme (grant agreement
No. 648635).

References

1. J. Rajendran, R. Karri, J.B. Wendt, M. Potkonjak, N.R. McDonald, G.S. Rose, B.T. Wysocki,
Nanoelectronic solutions for hardware security. IACR Cryptol. ePrint Arch. 2012, 575 (2012)
2. C. Stergiou, K.E. Psannis, B.-G. Kim, B. Gupta, Secure integration of iot and cloud computing.
Futur. Gener. Comput. Syst. 78, 964–975 (2018)
3. C. Herder, M.-D. Yu, F. Koushanfar, S. Devadas, Physical unclonable functions and applica-
tions: a tutorial. Proc. IEEE 102(8), 1126–1141 (2014)
4. M.-W. Ryu, J. Kim, S.-S. Lee, M.-H. Song, Survey on internet of things. SmartCR 2(3), 195–
202 (2012)
5. K.-K.R. Choo, M.M. Kermani, R. Azarderakhsh, M. Govindarasu, Emerging embedded and
cyber physical system security challenges and innovations. IEEE Trans. Dependable Secur.
Comput. 3, 235–236 (2017)
6. F. Tehranipoor, Towards implementation of robust and low-cost security primitives for resource-
constrained iot devices (2018), arXiv:1806.05332
7. H. Nili, G.C. Adam, B. Hoskins, M. Prezioso, J. Kim, M.R. Mahmoodi, F.M. Bayat, O. Kavehei,
D.B. Strukov, Hardware-intrinsic security primitives enabled by analogue state and nonlinear
conductance variations in integrated memristors. Nat. Electron. 1(3), 197 (2018)
8. S.K. Mathew, S. Srinivasan, M.A. Anders, H. Kaul, S.K. Hsu, F. Sheikh, A. Agarwal, S.
Satpathy, R.K. Krishnamurthy, 2.4 Gbps, 7 mw all-digital PVT-variation tolerant true random
number generator for 45 nm CMOS high-performance microprocessors. IEEE J. Solid-State
Circuits 47(11), 2807–2821 (2012)
128 R. Carboni and D. Ielmini

9. J. Katz, A.J. Menezes, P.C. Van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography
(CRC Press, 1996)
10. D. Ielmini, H.-S.P. Wong, In-memory computing with resistive switching devices. Nat. Elec-
tron. 1(6), 333 (2018)
11. J.J. Yang, D.B. Strukov, D.R. Stewart, Memristive devices for computing. Nat. Nanotechnol.
8(1), 13 (2013)
12. C.-H. Chang, Y. Zheng, L. Zhang, A retrospective and a look forward: fifteen years of physical
unclonable function advancement. IEEE Circuits Syst. Mag. 17(3), 32–62 (2017)
13. G.S. Rose, Security meets nanoelectronics for internet of things applications, in Proceedings
of the 26th Edition on Great Lakes Symposium on VLSI (ACM, 2016), pp. 181–183
14. S. Ghosh, Spintronics and security: prospects, vulnerabilities, attack models, and preventions.
Proc. IEEE 104(10), 1864–1893 (2016)
15. A. Alaghi, J.P. Hayes, Survey of stochastic computing. ACM Trans. Embed. Comput. Syst.
(TECS) 12(2s), 92 (2013)
16. J.S. Friedman, L.E. Calvet, P. Bessière, J. Droulez, D. Querlioz, Bayesian inference with Müller
C-elements. IEEE Trans. Circuits Syst. I: Regul. Pap. 63(6), 895–904 (2016)
17. W. Maass, Noise as a resource for computation and learning in networks of spiking neurons.
Proc. IEEE 102(5), 860–880 (2014)
18. P.A. Merolla, J.V. Arthur, R.Alvarez-Icaza, A.S. Cassidy, J. Sawada, F. Akopyan, B.L. Jackson,
N. Imam, C. Guo, Y. Nakamura et al., A million spiking-neuron integrated circuit with a scalable
communication network and interface. Science 345(6197), 668–673 (2014)
19. G. Pedretti, V. Milo, S. Ambrogio, R. Carboni, S. Bianchi, A. Calderoni, N. Ramaswamy, A.S.
Spinelli, D. Ielmini, Stochastic learning in neuromorphic hardware via spike timing dependent
plasticity with rram synapses. IEEE J. Emerg. Sel. Top. Circuits Syst. 8(1), 77–85 (2018)
20. G. Alvarez, S. Li, Some basic cryptographic requirements for chaos-based cryptosystems. Int.
J. Bifurc. Chaos 16(08), 2129–2151 (2006)
21. Maxim Integrated, Pseudo random number generation using linear feedback shift registers
(2010), Retrieved from Maxim Integrated website: http://www.maximintegrated.com/an4400
22. J. Von Neumann, Various techniques used in connection with random digits. Appl. Math. Ser.
12(36–38), 5 (1951)
23. J. Kelsey, B. Schneier, D. Wagner, C. Hall, Cryptanalytic attacks on pseudorandom number
generators, in International Workshop on Fast Software Encryption (Springer, 1998), pp. 168–
188
24. Suresh Chari, Charanjit Jutla, Josyula R Rao, and Pankaj Rohatgi. A cautionary note regard-
ing evaluation of aes candidates on smart-cards. In Second Advanced Encryption Standard
Candidate Conference, pages 133–147. Citeseer, 1999
25. N. Gisin, G. Ribordy, W. Tittel, H. Zbinden, Quantum cryptography. Rev. Mod. Phys. 74(1),
145 (2002)
26. B. Jun, P. Kocher, The Intel random number generator. Cryptogr. Res. Inc. White Pap. 27, 1–8
(1999)
27. S. Sahay, M. Suri, Recent trends in hardware security exploiting hybrid cmos-resistive memory
circuits. Semicond. Sci. Technol. 32(12), 123001 (2017)
28. R. Brederlow, R. Prakash, C. Paulus, R. Thewes, A low-power true random number generator
using random telegraph noise of single oxide-traps, in IEEE International Solid-State Circuits
Conference, 2006. ISSCC 2006. Digest of Technical Papers (IEEE, 2006), pp. 1666–1675
29. C.-Y. Huang, W.C. Shen, Y.-H. Tseng, Y.-C. King, C.-J. Lin, A contact-resistive-random-
access-memory-based true-random-number generator. IEEE Electron Device Lett. 33(8), 1108
(2012)
30. A. Fukushima, T. Seki, K. Yakushiji, H. Kubota, H. Imamura, S. Yuasa, K. Ando, Spin dice:
a scalable truly random number generator based on spintronics. Appl. Phys. Express 7(8),
083001 (2014)
31. S. Chun, S.-B. Lee, M. Hara, W. Park, S.-J. Kim, High-density physical random number
generator using spin signals in multidomain ferromagnetic layer. Adv. Condens. Matter Phys.
(2015)
4 Applications of Resistive Switching Memory as Hardware Security Primitive 129

32. Z. Wei, Y. Katoh, S. Ogasahara, Y. Yoshimoto, K. Kawai, Y. Ikeda, K. Eriguchi, K. Ohmori, S.


Yoneda, True random number generator using current difference based on a fractional stochastic
model in 40-nm embedded ReRAM, in 2016 IEEE International Electron Devices Meeting
(IEDM) (IEEE, 2016), pp. 4–8
33. S. Balatti, S. Ambrogio, Z. Wang, D. Ielmini, True random number generation by variability
of resistive switching in oxide-based devices. IEEE J. Emerg. Sel. Top. Circuits Syst. 5(2),
214–221 (2015)
34. S. Balatti, S. Ambrogio, R. Carboni, V. Milo, Z. Wang, A. Calderoni, N. Ramaswamy, D.
Ielmini, Physical unbiased generation of random numbers with coupled resistive switching
devices. IEEE Trans. Electron Devices 63(5), 2029–2035 (2016)
35. S. Zhou, W. Zhang, W. Nan-Jian, An ultra-low power CMOS random number generator. Solid-
State Electron. 52(2), 233–238 (2008)
36. E. Diehl, Ten Laws for Security (Springer, 2016)
37. J. Mathew, R.S. Chakraborty, D.P. Sahoo, Y. Yang, D.K. Pradhan, A novel memristor-based
hardware security primitive. ACM Trans. Embed. Comput. Syst. (TECS), 14(3), 60 (2015)
38. P. Kocher, J. Jaffe, B. Jun, Differential power analysis, in Annual International Cryptology
Conference (Springer, 1999), pp. 388–397
39. R. Pappu, B. Recht, J. Taylor, N. Gershenfeld, Physical one-way functions. Science 297(5589),
2026–2030 (2002)
40. M.-D. Yu, R. Sowell, A. Singh, D. M’Raïhi, S. Devadas, Performance metrics and empirical
results of a PUF cryptographic key generation ASIC, in 2012 IEEE International Symposium
on Hardware-Oriented Security and Trust (HOST) (IEEE, 2012), pp. 108–115
41. L. Zhang, Z.H. Kong, C.-H. Chang, A. Cabrini, G. Torelli, Exploiting process variations and
programming sensitivity of phase change memory for reconfigurable physical unclonable func-
tions. IEEE Trans. Inf. Forensics Secur. 9(6), 921–932 (2014)
42. D.E. Holcomb, W.P. Burleson, K. Fu, Power-up SRAM state as an identifying fingerprint and
source of true random numbers. IEEE Trans. Comput. 58(9), 1198–1210 (2009)
43. L. Gao, P.-Y. Chen, R. Liu, Y. Shimeng, Physical unclonable function exploiting sneak paths
in resistive cross-point array. IEEE Trans. Electron Devices 63(8), 3109–3115 (2016)
44. U. Rührmair, J. Sölter, F. Sehnke, X. Xiaolin, A. Mahmoud, V. Stoyanova, G. Dror, J. Schmid-
huber, W. Burleson, S. Devadas, PUF modeling attacks on simulated and silicon data. IEEE
Trans. Inf. Forensics Secur. 8(11), 1876–1891 (2013)
45. A. Vijayakumar, S. Kundu, A novel modeling attack resistant PUF design based on non-linear
voltage transfer characteristics, in Proceedings of the 2015 Design, Automation & Test in
Europe Conference & Exhibition (EDA Consortium, 2015), pp. 653–658
46. R. Waser, M. Aono, Nanoionics-based resistive switching memories. Nat. Mater. 6(11), 833
(2007)
47. H. Akinaga, H. Shima, Resistive random access memory (ReRAM) based on metal oxides.
Proc. IEEE 98(12), 2237–2251 (2010)
48. H.-S.P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen, B. Lee, F.T. Chen, M.-J. Tsai,
Metal-oxide RRAM. Proc. IEEE 100(6), 1951–1970 (2012)
49. D. Ielmini, Resistive switching memories based on metal oxides: mechanisms, reliability and
scaling. Semicond. Sci. Technol. 31(6), 063002 (2016)
50. S. Yu, H.-Y. Chen, B. Gao, J. Kang, H.-S.P. Wong, HfOx -based vertical resistive switching ran-
dom access memory suitable for bit-cost-effective three-dimensional cross-point architecture.
ACS Nano 7(3), 2320–2325 (2013)
51. H. Li, T.F. Wu, S. Mitra, H.-S.P. Wong, Resistive RAM-centric computing: design and modeling
methodology. IEEE Trans. Circuits Syst. I: Regul. Pap. 64(9), 2263–2273 (2017)
52. S.-G. Park, M.K. Yang, H. Ju, D.-J. Seong, J.M. Lee, E. Kim, S. Jung, L. Zhang, Y.C. Shin,
I.-G. Baek et al., A non-linear ReRAM cell with sub-1μa ultralow operating current for high
density vertical resistive memory (VRRAM), in 2012 IEEE International Electron Devices
Meeting (IEDM) (IEEE, 2012), pp. 20.8.1–20.8.4
53. J.Y. Seok, S.J. Song, J.H. Yoon, K.J. Yoon, T.H. Park, D.E. Kwon, H. Lim, G.H. Kim, D.S.
Jeong, C.S. Hwang, A review of three-dimensional resistive switching cross-bar array memories
130 R. Carboni and D. Ielmini

from the integration and materials property points of view. Adv. Funct. Mater. 24(34), 5316–
5339 (2014)
54. A. Bricalli, E. Ambrosi, M. Laudato, M. Maestro, R. Rodriguez, D. Ielmini. SiOx -based resis-
tive switching memory (RRAM) for crossbar storage/select elements with high on/off ratio, in
2016 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2016), pp. 4.3.1–4.3.4
55. D. Ielmini, Modeling the universal set/reset characteristics of bipolar RRAM by field-and
temperature-driven filament growth. IEEE Trans. Electron Devices 58(12), 4309–4317 (2011)
56. S. Larentis, F. Nardi, S. Balatti, D.C. Gilmer, D. Ielmini, Resistive switching by voltage-driven
ion migration in bipolar RRAM—part ii: modeling. IEEE Trans. Electron Devices 59(9), 2468–
2475 (2012)
57. S. Ambrogio, S. Balatti, A. Cubeta, A. Calderoni, N. Ramaswamy, D. Ielmini, Statistical fluc-
tuations in HfOx resistive-switching memory: part i-set/reset variability. IEEE Trans. Electron
Devices 61(8), 2912–2919 (2014)
58. S. Ambrogio, S. Balatti, A. Cubeta, A. Calderoni, N. Ramaswamy, D. Ielmini, Statistical
fluctuations in HfOx resistive-switching memory: part ii–random telegraph noise. IEEE Trans.
Electron Devices 61(8), 2920–2927 (2014)
59. S. Ambrogio, S. Balatti, V. McCaffrey, D.C. Wang, D. Ielmini, Noise-induced resistance broad-
ening in resistive switching memory—part i: intrinsic cell behavior. IEEE Trans. Electron
Devices 62(11), 3805–3811 (2015)
60. S. Ambrogio, S. Balatti, V. McCaffrey, D.C. Wang, D. Ielmini, Noise-induced resistance broad-
ening in resistive switching memory—part ii: array statistics. IEEE Trans. Electron Devices
62(11), 3812–3819 (2015)
61. D. Ielmini, F. Nardi, C. Cagli, Resistance-dependent amplitude of random telegraph-signal
noise in resistive switching memories. Appl. Phys. Lett. 96(5), 053503 (2010)
62. Y. Yoshimoto, Y. Katoh, S. Ogasahara, Z. Wei, K. Kouno, A ReRAM-based physically unclon-
able function with bit error rate < 0.5% after 10 years at 125 ◦ C for 40nm embedded application
in 2016 IEEE Symposium on VLSI Technology (IEEE, 2016), pp. 1–2
63. STS NIST, Special publication 800-22. A statistical test suite for random and pseudorandom
number generators for cryptographic applications (2010)
64. C. Cagli, F. Nardi, D. Ielmini, Modeling of set/reset operations in NiO-based resistive-switching
memory devices. IEEE Trans. Electron Devices 56(8), 1712–1720 (2009)
65. S.H. Jo, T. Chang, K.-H. Kim, S. Gaba, W. Lu, Experimental, modeling and simulation studies
of nanoscale resistance switching devices, in 9th IEEE Conference on Nanotechnology, 2009.
IEEE-NANO 2009 (IEEE, 2009), pp. 493–495
66. H. Jiang, D. Belkin, S.E. Savel’ev, S. Lin, Z. Wang, Y. Li, S. Joshi, R. Midya, C. Li, M. Rao
et al., A novel true random number generator based on a stochastic diffusive memristor. Nat.
Commun. 8(1), 882 (2017)
67. S. Gaba, P. Knag, Z. Zhang, W. Lu, Memristive devices for stochastic computing, in 2014 IEEE
International Symposium on Circuits and Systems (ISCAS) (IEEE, 2014), pp. 2592–2595
68. S. Gaba, P. Sheridan, J. Zhou, S. Choi, L. Wei, Stochastic memristive devices for computing
and neuromorphic applications. Nanoscale 5(13), 5872–5878 (2013)
69. S.H. Jo, K.-H. Kim, W. Lu, Programmable resistance switching in nanoscale two-terminal
devices. Nano Lett. 9(1), 496–500 (2008)
70. T. Ohno, T. Hasegawa, T. Tsuruoka, K. Terabe, J.K. Gimzewski, M. Aono, Short-term plasticity
and long-term potentiation mimicked in single inorganic synapses. Nat. Mater. 10(8), 591
(2011)
71. Z. Wang, S. Joshi, S.E. Savel’ev, H. Jiang, R. Midya, P. Lin, M. Hu, N. Ge, J.P. Strachan, Z. Li
et al., Memristors with diffusive dynamics as synaptic emulators for neuromorphic computing.
Nat. Mater. 16(1), 101 (2017)
72. A. Bricalli, E. Ambrosi, M. Laudato, M. Maestro, R. Rodriguez, D. Ielmini, Resistive switching
device technology based on silicon oxide for improved on-off ratio–part ii: select devices. IEEE
Trans. Electron Devices 65(1), 122–128 (2018)
73. R. Midya, Z. Wang, J. Zhang, S.E. Savel’ev, C. Li, M. Rao, M.H. Jang, S. Joshi, H. Jiang, P.
Lin et al., Anatomy of Ag/hafnia-based selectors with 1010 nonlinearity. Adv. Mater. 29(12),
1604457 (2017)
4 Applications of Resistive Switching Memory as Hardware Security Primitive 131

74. S. Ambrogio, S. Balatti, S. Choi, D. Ielmini, Impact of the mechanical stress on switching
characteristics of electrochemical resistive memory. Adv. Mater. 26(23), 3885–3892 (2014)
75. R. Carboni, W. Chen, M. Siddik, J. Harms, A. Lyle, W. Kula, G. Sandhu, D. Ielmini, Random
number generation by differential read of stochastic switching in spin-transfer torque memory.
IEEE Electron Device Lett. (2018)
76. W.H. Choi, Y. Lv, J. Kim, A. Deshpande, G. Kang, J.-P. Wang, C.H. Kim, A magnetic tunnel
junction based true random number generator with conditional perturb and real-time output
probability tracking. in 2014 IEEE International Electron Devices Meeting (IEDM) (IEEE,
2014), pp. 12.5.1–12.5.4
77. A. Fantini, L. Goux, R. Degraeve, D.J. Wouters, N. Raghavan, G. Kar, A. Belmonte, Y.-Y.
Chen, B. Govoreanu, M. Jurczak, Intrinsic switching variability in HfO2 RRAM, in 2013 5th
IEEE International Memory Workshop (IMW) (IEEE, 2013), pp. 30–33
78. R. Carboni, S. Ambrogio, W. Chen, M. Siddik, J. Harms, A. Lyle, W. Kula, G. Sandhu, D.
Ielmini, Understanding cycling endurance in perpendicular spin-transfer torque (p-STT) mag-
netic memory, in 2016 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2016),
pp. 21.6.1–21.6.4
79. J.J. Nowak, R.P. Robertazzi, J.Z. Sun, G. Hu, J.-H. Park, J.H. Lee, A.J. Annunziata, G.P. Lauer,
R. Kothandaraman, E.J. O’Sullivan et al., Dependence of voltage and size on write error rates
in spin-transfer torque magnetic random-access memory. IEEE Magn. Lett. 7, 1–4 (2016)
80. D. Apalkov, B. Dieny, J.M. Slaughter, Magnetoresistive random access memory. Proc. IEEE
104(10), 1796–1830 (2016)
81. A.F. Vincent, N. Locatelli, J.-O. Klein, W.S. Zhao, S. Galdin-Retailleau, D. Querlioz, Analytical
macrospin modeling of the stochastic switching time of spin-transfer torque devices. IEEE
Trans. Electron Devices 62(1), 164–170 (2015)
82. Z. Li, S. Zhang, Thermally assisted magnetization reversal in the presence of a spin-transfer
torque. Phys. Rev. B 69(13), 134416 (2004)
83. D. Vodenicarevic, N. Locatelli, A. Mizrahi, J.S. Friedman, A.F. Vincent, M. Romera, A.
Fukushima, K. Yakushiji, H. Kubota, S. Yuasa et al., Low-energy truly random number genera-
tion with superparamagnetic tunnel junctions for unconventional computing. Phys. Rev. Appl.
8(5), 054045 (2017)
84. A. Mizrahi, N. Locatelli, R. Lebrun, V. Cros, A. Fukushima, H. Kubota, S. Yuasa, D. Quer-
lioz, J. Grollier, Controlling the phase locking of stochastic magnetic bits for ultra-low power
computation. Sci. Rep. 6, 30535 (2016)
85. R. Heindl, W.H. Rippard, S.E. Russek, M.R. Pufall, A.B. Kos, Validity of the thermal activation
model for spin-transfer torque switching in magnetic tunnel junctions. J. Appl. Phys. 109(7),
073910 (2011)
86. A. Chen, Utilizing the variability of resistive random access memory to implement reconfig-
urable physical unclonable functions. IEEE Electron Device Lett. 36(2), 138–140 (2015)
87. K. Kursawe, A.-R. Sadeghi, D. Schellekens, B. Skoric, P. Tuyls, Reconfigurable physical
unclonable functions-enabling technology for tamper-resistant storage (2009)
88. J. Zhou, K.-H. Kim, L. Wei, Crossbar rram arrays: selector device requirements during read
operation. IEEE Trans. Electron Devices 61(5), 1369–1376 (2014)
89. Y.Y. Chen, M. Komura, R. Degraeve, B. Govoreanu, L. Goux, A. Fantini, N. Raghavan, S.
Clima, L. Zhang, A. Belmonte, A. Redolfi, G.S. Kar, G. Groeseneken, D.J. Wouters, M. Jurczak,
Improvement of data retention in HfO2 /Hf 1T1R RRAM cell under low operating current
90. Y. Xie, A. Srivastava, Mitigating sat attack on logic locking, in Cryptographic Hardware and
Embedded Systems – CHES 2016, ed. by B. Gierlichs, A.Y. Poschmann (Springer, Berlin,
2016), pp. 127–146
91. H. Jiang, C. Li, R. Zhang, P. Yan, P. Lin, Y. Li, J.J. Yang, D. Holcomb, Q. Xia, A provable key
destruction scheme based on memristive crossbar arrays. Nat. Electron. 1(10), 548–554 (2018)
Chapter 5
Memristive Biosensors for Ultrasensitive
Diagnostics and Therapeutics

Ioulia Tzouvadaki, Giovanni De Micheli and Sandro Carrara

Abstract The coupling of memristive effect with biological interactions results in


innovative nanobiosensors with high performance in both diagnostics and thera-
peutics. Silicon nanowire arrays exhibiting a memristive electrical response are
acquired through a top-down nanofabrication process. Surface treatments imple-
menting sophisticated bio-functionalization strategies and adopting suitably selected
biological materials give rise to the memristive biosensors. The particular electrical
response of these novel biosensors leverages the modification of the hysteretic prop-
erties exhibited by the memristive effect before and after the bio-modification, to
achieve an efficient detection of biological processes. Memristive biosensors suc-
cessfully address the issue of the early detection of cancer biomarkers providing
a new technology for high performance, ultrasensitive, label-free electrochemical
sensing platforms. They also offer the capability of detecting extremely small traces
of cancer biomarkers, as well as effective screening and continuous monitoring of
therapeutic compounds in full human serum bringing novelty and solutions in the
medical practice, especially in the field of personalized medicine.

5.1 Challenges in Biosensing

Even nowadays, the medical devices still face several limitations concerning rapid,
reliable, and ultrasensitive sensing of biomarkers from a minimized volume of clini-
cal samples. In particular, cancer diagnosis usually involves uncomfortable medical
tests, long waiting times for the results of the medical assessment, nonetheless, risk-
ing to obtain an uncertain medical outcome. An other very important aspect is the

I. Tzouvadaki (B) · G. De Micheli · S. Carrara


Integrated Systems Laboratory, EPFL, Lausanne, Switzerland
e-mail: [email protected]
G. De Micheli
e-mail: [email protected]
S. Carrara
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2020 133


M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_5
134 I. Tzouvadaki et al.

diagnosis of the disease at early stages, when suitable therapy decision-making can
be taken into consideration for treatment, giving higher probability of success at the
beginning of the disease. However, the diagnostic tools still lack the level of resolu-
tion needed for the detection of biomarkers at the early stages of the disease. More-
over, the clinical practice still lacks analytical methods for efficient, ultrasensitive
monitoring of therapeutic compounds. Reliable, low-cost, and accessible therapeu-
tic compound monitoring systems for individualized health care, and especially for
treatment of malignant diseases, such as cancer and AIDS consist of a very impor-
tant aspect in medical practice. These requirements are even more highlighted for
drugs demonstrating a very narrow therapeutic window which is also depicted at low
concentrations. Moreover, different patients may present different responses to the
very same dose of drug, giving different therapeutic response from what expected.
Therefore, the realization of novel ultrasensitive nanobiosensors for the direct and
label-free detection of chemical and biological species which present high reliability,
robustness, and the advantage of a quick data acquisition may achieve optimum sens-
ing output in both diagnostics and therapeutics fields, opening to early diagnostics
and a treatment with higher efficacy and lower side effects for patients.
Nanostructure-based sensors are considered as a highly promising strategy to
address the issues of sensitivity and limits of detection for both diagnostics and ther-
apeutics and may allow the integration of the sensors in portable devices including
microfluidics and electronics for robust, flexible, and automatized clinical applica-
tions. Silicon (Si) nanowires with their unique properties such as the high surface-
to-volume ratio and the size comparable to biomolecules, and combined with the
specificity of immune-sensing techniques, may provide optimum biosensing plat-
forms [1].
In addition, although the fact that memristive effect has already been introduced
in many different applications, very few of the implementations are dedicated for
bio-detection. Carrara et al. [2] demonstrated for the first time the potential use of
memristive effects in nanostructured devices for biosensing applications. Therefore,
the aspect of memristive phenomena is expanded and enlarged by coupling nanofab-
ricated devices that express memristive phenomena with biological processes, for
introducing novelty and bringing new solutions to the biosensing field.

5.2 Nanofabricated Memristive Sensors for Bio-detection

Nanowire arrays are emerging as promising building blocks for miniaturized bioas-
says. In the case of the memristive nanowires, the biosensing is based on the variations
of a voltage difference introduced in the semi-logarithmic current-to-voltage charac-
teristics upon the introduction of charged substances on the surface. The memristive
nanowires are realized by using commercially available Silicon-on-Insulator wafers
and the nanofabrication can be summarized in two electron-beam (e-beam) lithog-
raphy masks. The first e-beam lithography mask is dedicated to the definition of the
nanodevice electrodes. The electrodes creation is realized through Nickel (Ni) evap-
5 Memristive Biosensors for Ultrasensitive … 135

Schottky Barrier regions

SiO2

Source Drain

Si-NW arrays

Fig. 5.1 SEM top and tilted view of the vertically stacked nanowire structures bridging NiSi source
and drain contacts (Reproduced with permission from [3]. Copyright 2016 American Chemical
Society)

oration, liftoff, and annealing processes. The second e-beam lithography operation is
performed for the nanowire patterning, and then as a last step, the nanowire structures
are etched through repeated Bosch process etching cycles of the upper Si. Overall,
this process results to suspended, vertically stacked, two-terminal, Schottky-barrier Si
nanowire arrays anchored between the two nickel silicide (NiSi) pillars (Fig. 5.1) for
devices designed with a smaller geometry, i.e., length of 420 nm and width of 35 nm,
and of larger geometry, i.e., length of 980 nm and width of 90 nm (inset Fig. 5.1). The
particular electrical response of those memristive nanodevices provides a label-free
ultrasensitive bio-detection method. More specifically, the electrical characterization
of the nanodevices is performed with double sweeping the source-to-drain voltage
(Vds) at a fixed 0 V back-gate potential. One of the distinctive features of the electri-
cal response of these nanowires is the recorded hysteresis loop that it is characteristic
of a memristive system (Fig. 5.2 top). In these nanodevices, the memory effect can be
attributed to the rearrangement of the charge carriers at the nanoscale due to external
perturbations [4], such as an applied voltage bias. For most bare nanowire devices,
this hysteresis appears fully pinched at zero voltage. In some other cases, this hys-
teresis is almost pinched at very close to zero voltage values due to the impact of
environmental conditions, such as the ambient humidity that introduces perturbations
to the conductivity of the device, affecting in great deal the memristive signals.
Typically, a modification of the hysteresis in the memristive electrical character-
istics is depicted after surface treatment of the nanodevice. The charged nature of the
biological molecules brings to the nanodevice an effect similar to the one brought
136 I. Tzouvadaki et al.

Memristive nanowires electrical response


10 -7 -6
4
-6.5
3 -7
-7.5
2

Log|I| [A]
-8
I [A]

1 -8.5
-9
0 -9.5
-10
-1
-10.5
-2 -11
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Input Voltage [V] Input Voltage [V]

Memristive sensors electrical response


10 -7
1 -6

0.5 -7

-8
0
Log|I| [A]

-9
I [A]

-0.5
-10

-1 -11

-1.5 -12 Voltage Gap

-13
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Input Voltage [V] Input Voltage [V]

Fig. 5.2 Experimental electrical response obtained for bare two-terminal Schottky-barrier Si
nanowire arrays exhibiting memristive characteristics (memristive devices)—top—and experimen-
tal electrical response after surface treatment (memristive biosensors)—bottom. The pinched hys-
teresis and memristive characteristics are lost giving rise to a voltage gap in the semi-logarithmic
electrical characteristics

without any bio-functionalization but with the presence of an inorganic all around
gate [2]. The net charge from the presence of biomolecules induces a change in the
initial hysteresis creating a sort of voltage memory, appearing as a voltage difference,
so-called voltage gap, between the current minima positions in the forward and back-
ward branches, as a further memory effect of the voltage scan across the nanowire
(Fig. 5.2 bottom) [2, 5]. More specifically, this voltage gap is dependent from the
kind and the concentration of the charged substances introduced on the device sur-
face and it is very sensitive to the charges interplay. Overall, the memristive devices
are accordingly bio-functionalized with receptor molecules for obtaining memristive
biosensors and then exposed to the target molecules providing ultrasensitive sens-
ing through the variations of this voltage gap that consists the main bio-detection
parameter.
5 Memristive Biosensors for Ultrasensitive … 137

5.2.1 Surface Treatments for Memristive Biosensors

For achieving effective and efficient bio-functionalization of the memristive nanos-


tructures, the following main phases must be performed: (a) Surface Activation
where hydroxylation of the surface is achieved via exposure of the surface to piranha
solution (H2 SO4 –H2 O2 ) or O2 plasma, (b) Surface pretreatment to enable the

(i)
(a)

(b)

(ii)
(a) (b)
Nanowire arrays

Fig. 5.3 i. AFM morphological analysis of nanowire arrays before (a) and after the bio-modification
with an anti-Prostate Specific Antigen antibody (b). After bio-functionalization a clear change in the
morphology can be seen and an agglomeration of biomolecules can be observed on the surface of the
nanodevices masking the initial shape of the nanowires ii. Confocal microscopy of nanofabricated
structures before and after bio-functionalization: 3D fluorescence signal distribution acquired using
CLSM (wire-arrays of width 150 nm and length of 4.8 µm) before (control sample) (a) and after
the bio-modification with FITC-conjugated antibodies (b). The bright regions in the right image
correspond to the accumulated biomolecule in the sample (Reference [6]-Reproduced by permission
of The Royal Society of Chemistry)
138 I. Tzouvadaki et al.

optimum receptor molecule coupling on the nanodevices’ surface. These surface


treatments leverage either the high affinity between biotin and streptavidin (Affinity
Approach), or are based on covalent binding through the implementation of a silane
(Covalent Attachment) chosen as a linker molecule or physisorption (Direct Adsorp-
tion) composed of interactions involving van der Waals forces, electrostatic or the
much stronger hydrophobic interactions, (c) Receptor molecules immobilization,
namely, full-chain antibodies, antibody fragments, or DNA aptamers through ade-
quate incubation in the receptor molecule solution. Therefore, through this step the
memristive devices give rise to memristive biosensors, and (d) Exposure to target
molecule involving disease biomarkers or/and therapeutic compounds (Fig. 5.3).

5.3 Sensing Performance of Memristive Biosensors

5.3.1 The Effect of Charged Residues—Sensing of Charged


Polymeric Films1

The modification of the related electrical-conductivity hysteresis and the voltage gap
variations due to the presence of charged macromolecules was investigated and fully
characterized through the deposition of layer-by-layer charged polymeric films, i.e.,
via the implementation of a polyelectrolyte (PE) multilayer. PEs are linear macro-
molecular chains bearing a large number of charged groups when dissolved in a suit-
able polar solvent. Among them, PSS (Poly (sodium 4-styrene sulfonate)) is a strong
polyelectrolyte negatively charged in a wide pH range while PAH (Poly(allylamine
hydrochloride)) is a weak polyelectrolyte positively charged in neutral or acidic solu-
tion [7]. Subsequent depositions finally result in a PE multilayer stabilized by strong
electrostatic forces [8].
The formation of PE multilayers is based on the consecutive adsorption of polyions
with alternating charge using the layer-by-layer (LBL) technique as described by
Chen and McCarthy [9]. The PEs multilayer is formed by consecutive alternate
adsorption of positively charged PAH and negatively charged PSS prepared PE
solutions. Electrical characteristics acquired (Fig. 5.4) indicated the average volt-
age gap value after deposition of each layer of PE for two different concentrations.
The first electrical measurements were performed after -OH treatment leading to the
appearance of the voltage gap. Afterward, the first PAH adsorption results in nar-
rowing of the voltage gap (0.09 V difference for 200 nM concentration and 0.16 V
difference for the case of 50 µM of PE, respectively). This change is a result of the
change in the charge density at the surface of the device due to the positively charged
PAH, an effect that is even more pronounced when using a higher concentration of
PAH: as more positive charges are present on the surface, a larger voltage gap change
is registered.

1 Source of original text [3].


5 Memristive Biosensors for Ultrasensitive … 139

-OH groups

PSS

0,30 Polyelectrolyte 200nM


PAH
Polyelectrolyte 50uM
0,25
Voltage Gap [V]

-OH groups
0,20 PSS
PSS PAH
PSS
0,15 PAH
PAH PSS
PAH
0,10 PAH

PSS
0,05
PSS PSS
PAH
0,00
PAH
PSS
1 2 3 4 5 6 7 PAH

Polyelectrolyte Layers

Fig. 5.4 Formation of a multilayer of PEs by repeated electrostatic adsorption of oppositely charged
PE layers; Average voltage gap value obtained from electrical characterization of devices treated
with Layer-by-Layer deposition of PEs for 200 nM (red points) and 50 µM (black points) (Repro-
duced with permission from [3]. Copyright 2016 American Chemical Society)

The adsorption of the negatively charged PSS shifts again the average voltage gap
to a higher value (0.17 V) and further treatment with PAH results in a new decrease
of the average voltage gap value (0.15 V for 200 nM) concentration of PE. There-
fore, it is demonstrated that further alternating exchange of the PE solution causes
an alternating output signal, which slowly reduces in amplitude. On the other hand,
the consecutive adsorption of the same type of PE (successive adsorption of PSS is
presented in Fig. 5.4) was tested by implementing the highest concentration of PE
results in the acquisition of one direction trend for the voltage gap that increases
form the value of 0.05–0.21 V. Facing these interesting characteristics of the mem-
ristive nanodevices, the fabricated memristive nanostructures are thereupon applied
in the biosensing field, enabling the detection of femtomolar and even attomolar con-
centrations. The charges interplay (positive/negative) brought by the receptor/target
molecules and the concentration of the reagents (increasing/decreasing) is defining
the width of the voltage gap parameter which then lands to an ultrasensitive bio-
detection method with an immense potential for novel biosensing.

5.3.2 Sensing Strategies

The effect demonstrated through the implementation of the charged polyelectrolytes


can be correlated to the receptor/target molecule interplay. Indicatively, the effect
depicted for the alternative introduction of opposite kind of charged PEs is similar
to the voltage gap trend exhibited in the case of the previously reported antibody and
negatively charged antigen binding (Fig. 5.5). Taking into consideration the struc-
tural composition of antibody species that consists of long amino acid chains, under
140 I. Tzouvadaki et al.

Fig. 5.5 Electrical response


after bio-functionalization—
black—and antigen uptake
with two different biomarker
concentration—blue and
red—(elaborated by [2])

Current [A]
Further
Antigen Uptake

Bio-functionalization Antigen
with Antibodies Uptake

Voltage [V]

correct physiological conditions (pH 7.4) arginine and lysine residues are positively
charged while aspartic and glutamic acids are negatively charged. In an antibody,
positively charged residues are in excess with respect to negatively charged ones
even if the charge distribution is quite similar [2]. On the contrary, antigens like PSA
are negatively charged; therefore, when antibody–antigen binding occurs, an excess
of negative charge accumulates at the nanowire surface that increases with respect
to the increasing antigen concentration as the target molecules uptake progresses.
Meanwhile, taking into consideration that aptamers are single-stranded RNA or DNA
oligonucleotides, they are considered negatively charged, therefore, for aptamer and
negatively charged antigen/drug pairs a one-way, increasing trend for the voltage gap
is expected with the increasing antigen/drug concentration.

5.3.3 Factors Affecting the Memristive Biosensors’


Performance

5.3.3.1 The In-Dry Measurement Concept

The Debye screening length between the sensor surface and the analyte determines
the extent of a space charge region near a discontinuity and it is commonly intro-
duced when performing measurements in-liquid conditions. The surface charges of
biomolecules in a buffer solution are shielded by oppositely charged buffer ions
in the solution, so-called counterions. Therefore, the Debye length may potentially
mask the sensing outcomes in some cases, for instance, for extremely low sample
5 Memristive Biosensors for Ultrasensitive … 141

concentrations (Debye screening limitation). For this reason, measurements involv-


ing the memristive biosensors are performed not in-liquid but in-air [5], following a
novel paradigm of detection via measurements in-dry conditions, under controlled
relative humidity, where the sample is thoroughly dried after the exposure to the
target reagent, and only an ultrathin layer of water formed by the ambient humidity
is present and in high proximity to the nanowire surface. Although the sensors are
dried after bio-modification and cleaning steps, the nanosensor surface is then never
completely dry, allowing the proper functioning condition of the proteins and the
stable and proper interactions of the probe-target molecules system [10]. Since the
electrical characterization is performed in-dry conditions, there is a negligible Debye
layer formation and the setup is in the framework of surface and stern layers, namely,
at planes before the slipping plane. In addition, considering that the Debye length is
negligible consequently, the zeta potential is negligible as well, and, therefore, the
potential of interest in the suggested setup is the surface potential and its variations.

5.3.3.2 The Role of Ambient Humidity

Environmental humidity conditions on the electrical response of the nanodevice


affecting to a great extent the memristive signals and the obtained hysteresis [6,
11]. The higher the humidity in the treatment area the more hydroxyl groups are
introduced on to the surface of the sensor, inducing perturbations to the conductivity
of the device’s channel. It was observed that for low rH%, the voltage gap value
for bare devices is zero or close to zero values. Increasing the rH% introduces a
small voltage difference between the forward and the backward regimes, due to
the presence of water molecules originating from the environmental humidity that
adsorb and accumulate on the nanowire surface, finally forming a thin liquid film on
the nanowire surface. The charges of water molecules act on the virtual gate voltage
similarly to those of charged chemical and biological species affecting the memristive
behavior of the nanodevices. It is implied that ideal pinched hysteresis, matching
perfectly the theoretical aspect would be achieved by measurements performed under
ideal conditions such as under high vacuum. It is worth noting that after 45% of rH%,
the voltage gap is almost constant and the system appears to be saturated (Fig. 5.6).
The voltage gap values in the case of bare devices are in the range of 0–0.16 V,
demonstrating the stability of the device prior to any modification with respect to the
humid environment.

5.3.3.3 The Size of Bio-recognition Element

An interesting relationship between the size of the bio-recognition element used


and the voltage gap value measured after bio-functionalization was depicted. More
specifically, the electrical performance of the biosensors was investigated in terms
of the hysteresis modification for different bio-functionalization reagents, namely,
full-size immunoglobulin antibodies (IgG), antibody fragments (scAb), and DNA
142 I. Tzouvadaki et al.

Memrsitive Device rH% Calibration

Voltage Gap [V]

Saturation
High humidity
impact

Relative Humidity (rH)

Fig. 5.6 Memristive device rH% Calibration: Average voltage gap value exhibited by non-bio-
modified nanostructures just after the fabrication process tested under different relative humidity
conditions (rH%)

Fig. 5.7 Voltage gap dependence upon the bio-functionalization reagent: anti-PSA DNA aptamers
(≈15 kDa), anti-PSA ScAb (≈42 kDa), and full-size anti-PSA IgG antibody (≈150 kDa) demon-
strate different sizes, and therefore correspond to different voltage gap values resulting in a linear
trend for the voltage gap–reagent size relation [12]

aptamers. The full-size antibodies and antibody fragments implemented, demonstrate


different structures and different sizes. The bio-functionalization reagents size affects
the hysteresis modification, namely, the voltage gap obtained (Fig. 5.7). The voltage
gap appears proportional to the antibody size that can be logically interpreted in terms
of the net positive charge accumulated on the nanodevice surface and, consequently,
the value of the virtual bio-gate voltage that increases with the size of the linked
antibody as the net charge introduced on the nanodevice increases. A further proof
to this aspect is brought by the voltage gap value of a memristive biosensor based
5 Memristive Biosensors for Ultrasensitive … 143

on DNA aptamers that exhibit an average molecular weight of 15 kDa. Overall, a


direct relationship between the size of the bio-recognition element applied and the
voltage gap value is demonstrated, showing the potential for design flexibility and
compatibility with respect to the target molecules and the desired implementation,
thus opening new possibilities to the fabrication of application-oriented memristive
biosensors.

5.4 Modeling Memristive Biosensors

5.4.1 Equivalent Circuit Model Based on Memristors

Due to the wide application possibilities that memristor devices may offer, several
efforts have been done to study the memristive behavior experimentally as well as
computationally [13–20]. Besides theoretical aspects and experimental studies, mod-
els which approximate well the physical realization are needed. In this framework,
it is worth mentioning the development of a simple compact model for representing
the electrical behavior of memristors introduced by Biolek et al. [21] describing a
mathematical SPICE model of the prototype of memristor, manufactured in 2008
in Hewlett-Packard (HP) Labs [13]. Furthermore, Benderli [22] suggested a macro-
model which simulates the electrical behavior of the thin-film titanium dioxide (TiO2 )
memristors. Last but not least in their work, Rak et al. [23] create a memristor ele-
ment in Spice which simulates the published memristor realization introduced by
HP Labs and offers the possibility to be used as a circuit element in design work. For
the computational study of the memristive biosensors, a macromodel of a memristor
element was created and combined with analog circuit elements forming equivalent
circuit models that reproduced and emulated successfully the behavior of the phys-
ical system fitting in good approximation the experimental results of memristive
biosensors [24]. Throughout simulations and adequate fitting between the experi-
mental and computational outcomes, it was found that the electrical characteristics
obtained from experimental measurements exhibit hysteretic properties imputable to
memristive devices validating the hypothesis that the experimental setup deals with
memristive behavior and confirming the memristive nature of the physical system.
In addition, the voltage gap appearing at the current-to-voltage characteristics for
nanowires with bio-modified surface was successfully reproduced computationally
and was related to capacitive effects due to minority carriers in the nanowire [24].
It was also indicated that those effects are strongly affected by the concentration of
biomolecules uptaken on the device surface.

5.4.2 Bare Si Nanowire Devices: Memristive Devices

According to previous works mentioned in literature [25–28], Si nanowire FETs with


Schottky source and drain contacts can be modeled as metal–semiconductor–metal
144 I. Tzouvadaki et al.

(b) Electrical characteristics: Memristive Device


Simulation
Experiment
(a) R R
M

Log|I| [A]
Vin D D

Input Voltage [V]

Fig. 5.8 Equivalent circuit of a memristor sandwiched between two non-identical head-to-head
Schottky barriers. The sub-circuits consisting of a diode in parallel to a resistor emulate the effect of
the Schottky barriers. The circuit consists of resistances in the range of 0.5 k–1k and common
Si epitaxial planar fast switching diodes provided by SPICE (a). Semi-logarithmic current-to-
voltage results obtained from the equivalent circuit comparing to experimental results coming from
electrical measurements of bare memristive device (b). The simulation current is scaled accordingly
to the experimental current range and the input voltage amplitude is [−3:3] Volts. Since, by nature
the devices usually present a non-identical behavior multiple experimental curves are presented
and it is demonstrated that the computationally obtained results follow in good approximation the
average behavior of the physical system, and it can be concluded that the experimental setup exhibits
memristive behavior (elaborated by [24])

(M-S-M) structures with finite Schottky-barrier heights. The modeling is based on


equivalent circuits including a Schottky diode representing the metal–semiconductor
contacts and considering the nanowire as a resistor. Lee et al. [27] developed an equiv-
alent circuit model which consisted of one reverse-biased Schottky diode, one resis-
tor, and one forward-biased Schottky diode connected in series. Thus, the intrinsic
nanowire channel is modeled as one linear resistor and the gate voltage dependence
of the nanowire was not included. In addition, there has also been introduced [25]
a Si nanowire FET model based on an equivalent circuit consisting of two Schottky
diodes for the M-S contacts and one MOSFET for the intrinsic Si nanowire FET. Fur-
thermore, Elhadidy et al. [26] modeled the symmetrical, nonlinear current-to-voltage
characteristics of a metal–semiconductor–metal structure of two metallic Schottky
contacts fabricated to a p-type semiconductor by treating the semiconductor as a resis-
tor sandwiched between two identical head-to-head Schottky barriers. Each one of
the two Schottky barriers is modeled as a sub-circuit consisting of a diode in parallel
to a resistor (Fig. 5.8). For the case of the memristive nanowires, an equivalent circuit
model was developed by following the concept introduced by Elhadidy et al. [26]
consisting of a memristor sandwiched between two identical head-to-head Schottky
barriers [24]. The Schottky barriers were represented by (RD) sub-circuits consisting
of a diode in parallel to a resistor and result in a slight modification of the memristive
curve, bringing the typical Schottky contact shape at the branches, without affecting
the location of the current minima. A unique current corresponds to each applied
voltage. If the polarity of the bias voltage is exchanged, the reverse-biased barrier
5 Memristive Biosensors for Ultrasensitive … 145

would be exchanged with the forward-biased one and vice versa. For consistency
reasons and interest, the input values of the sinusoidal Voltage (Vin) source were the
same with respect to the case of the pure memristor equivalent circuit. Experimental
current-to-voltage characteristics present noticeable asymmetry at the branches of the
semi-logarithmic current-to-voltage characteristics. Under ideal circumstances, the
electrical characteristics in both branches of the semi-logarithmic current-to-voltage
curve would be symmetrical since the Schottky barriers of the device structure are
considered to be identical. Nevertheless, the measured data in real experimental con-
ditions indicates non-identical branches for the majority of the devices under study.
This slight difference in the branches asymmetry may be explained as a consequence
of the non-identical area of contacts occurring in the real conditions, mainly due to
the presence of the different interfacial insulating layers at both electrode contacts.
In order to emulate this asymmetry arising in the physical system, and considering
that the one diode does not conduct during the one circle of the voltage, as mentioned
before, the equivalent circuit in this specific case could be simplified by replacing the
one of the two autonomous sub-circuits (RD) by a resistor. Therefore, the concept
of the non-identical Schottky barriers is taken into consideration in the equivalent
electrical circuit through the equivalent resistor, and in combination to the fact that
during the one circle of the current (depending on the polarity of the bias voltage)
the one diode does not conduct due to the forward and reverse bias nature of the
diode and consequently only the remaining resistivity origin from the reverse-biased
Schottky diode finally contributes. It was demonstrated that the simulation results
followed in good approximation the average behavior of the physical system, and
presented current-to-voltage characteristic curve equivalent to that of a memristor
device electrically contacted by two asymmetric Schottky barriers, validating the
hypothesis that the experimental setup deals with memristive behavior.

5.4.3 Bio-functionalized Silicon Nanowire Devices:


Memristive Biosensor

A Schottky diode can also be described with an equivalent circuit model consist-
ing of a nonlinear capacitor in parallel to a nonlinear resistor according to literature
[29]. The capacitor stands for the space charge capacitance and reflects only the
free carriers of the material, while the resistor represents the residual conductance
of the diode. In the case of lightly doped materials, the free-carrier concentration
can become comparable to the deep level concentration, and in this case, charged
and recharged deep levels also contribute considerably to the measured capacitance.
In a Schottky barrier, the barrier is high enough that there is a depletion region in
the semiconductor near the interface. In the depletion region of the Schottky barrier,
dopants remain ionized and give rise to the “space charge” which, in turn, gives rise
to a capacitance of the junction. The metal–semiconductor interface and the opposite
boundary of the depleted area act like two capacitor plates, with the depletion region
acting as a dielectric. The amount of junction capacitance initially depends on the
applied terminal voltages. By applying a voltage to the junction, the width of the
146 I. Tzouvadaki et al.

(b) Electrical characteristics: Memristive Biosensor


Simulation

(a) Experiment

Rc Rc
M

Log|I| [A]
Vin
C C

Input Voltage [V]

Fig. 5.9 Equivalent circuit for memristive biosensors consisting of a memristor and nonlinear
sub-circuits (RC). The sub-circuits (RC) introduced consist of a nonlinear capacitor in parallel
to a nonlinear resistor (a) and semi-logarithmic current-to-voltage results simulation (red curve)
obtained from the equivalent circuit, as compared to experimental results coming from electrical
measurements for the case of memristive biosensor, namely, the nanofabricated device after the bio-
functionalization with antibodies, for three voltage sweeps (green curves). The simulation current
is scaled accordingly to the experimental current range. The input voltage amplitude is [−3:3] Volts
and the resistances value at 0.85 k (elaborated by [24])

space charge layer will be shifted and the space charge within the depletion region
will vary, since additional defect centers will be ionized, and as a result the capac-
itance will also be different. Furthermore, the charging and recharging of the trap
levels during a measurement cycle periodically change the Schottky-barrier height,
and finally the modified measurement current gives a capacitive contribution to the
diode admittance. Thus, both effects, variation of bias and consequently the ioniza-
tion of traps, cause a change in the junction capacitance [30]. An equivalent circuit
containing nonlinear sub-circuits (RC) consisting of a nonlinear capacitor in parallel
to a nonlinear resistor was further then introduced (Fig. 5.9a) in order to correctly
model the appearance of the voltage gap, for the case of memristive biosensors.
The sub-circuits were connected in series to the memristor (M) of the initial equiva-
lent circuit. It was demonstrated (Fig. 5.9b) that the two current minima are clearly
separated and a voltage gap appears in the semi-logarithmic current-to-voltage char-
acteristics due to the presence of the capacitors now introduced. The fitting of the
simulation results with the experimental data confirms that the voltage gap appearing
at the experimental current-to-voltage characteristics for the memristive biosensors
was computationally reproduced successfully and fitted in very good approximation
with the experimental outcomes. Measurement of the junction capacitance is a very
useful technique, giving information on Schottky-barrier heights, dopant profiles, as
well as the presence of traps and defects inside the semiconductor and at the interface
[31]. Accumulating evidence from several works concerning relevant measurements
[29–32] report values for capacitances that appear in the junction area that refers to
the space charge capacitance, to belong in the range beginning of pF [29, 30] and
reaching the values of nF [31, 32]. According to literature, the excess capacitance
5 Memristive Biosensors for Ultrasensitive … 147

is a result of the combination in parallel of the space charge capacitance character-


izing the diode and of the diffusion capacitances due to the injection of minority
carriers. The reported values concerning the excess capacitance reach 43 nF and it
is considered that the excess capacitance mainly originates from the bulk Si rather
than the interface of the diode in study [32] while typical capacitance values con-
cerning only contributions by the depletion area are in the range of pF [29, 30]. It
is worth mentioning that all the values found for the equivalent capacitance fit quite
well the values reported in literature concerning the excess capacitance, and thus
it is demonstrated that the presence of antibodies and thereafter of antigens on the
memristive biosensor interacts deeply with the conductivity of the channel related
to minority carriers. The width of the simulated voltage gap can be modulated by the
variation of the value of the capacitance introduced in the circuit. More specifically,
it is observed that the two local minima are converging or shifting away the one
from the other by modifying the input values of the capacitance introduced to the
equivalent circuit. Experimental observations [2] identify a similar behavior of volt-
age gap modification with respect to the type and the concentration of the biological
molecules uptaken on the device surface, and an enlargement of the hysteresis win-
dow due to the presence of charged molecules around the freestanding channel, i.e.,
the antibodies, after the bio-functionalization process, are experimentally noticed.
Accumulating data suggests that the maximum voltage gap observed is of a value of
1 V approximately.

5.4.4 Antigen Uptake

The presence of antigens on the device surface seems to demonstrate the opposite
effects comparing to those resulting due to the presence of the antibodies. Antigens
are considered to have a masking contribution to the presence of antibodies all around
the device and decrease the positive charge effect due to the presence of antibodies
after the bio-functionalization process. Thus, the uptake of antigens acts by decreas-
ing the value of the positive all around gate bias voltage created by the presence of
antibodies. According to the previous arguments, the presence of antigens also affects

Table 5.1 Voltage gap values obtained experimentally for different antigen concentrations and
computationally estimated voltage gap values for different values of capacitance, selected according
to the experimental data. For 0 fM concentration of antigens, it is considered that the voltage gap
that appears is created only due to the bio-functionalization with antibodies
Antigen concentration Voltage gap (V) Capacitance (nF) Voltage gap (V)
(fM) -experimental- [2] -simulation-
0 0.84 36 0.844
5 0.56 24 0.563
10 0.37 15 0.362
148 I. Tzouvadaki et al.

the width of the voltage gap, which is already created by the presence of antibodies
all around the device after bio-functionalization. Collectively, the experimental data
depicts a contraction of the hysteresis window with increasing the concentration of
antigens. To further define the role of the capacitance value to the voltage gap, the
experimentally obtained results concerning voltage gap values for different antigen
concentrations [2] were taken into consideration and different capacitance values
were introduced to the aforementioned equivalent circuit (Fig. 5.9a) designed for
simulating the modified-memristive behavior, in order to reproduce computationally
the voltage gap values obtained experimentally as reported in Table 5.1. Furthermore,
the calibration curve (Fig. 5.10) depicted the computationally estimated voltage gap
values that equal the values of the voltage gap obtained experimentally [2] for differ-
ent antigen concentrations. The computationally obtained voltage gap values are a
result of the different equivalent capacitance values introduced to the equivalent cir-
cuit and it is found that corresponds to the values reported in literature for the excess
capacitance [32]. Intermediate theoretical values for the voltage gap obtained from
simulations for different capacitance values are also shown in the figure (Fig. 5.10).
It can be noticed that for achieving narrower voltage gaps lower values for the capac-
itance must be introduced to the equivalent circuit for reproducing computationally
the corresponding experimental obtained voltage gap. This evidence suggest that

Antigen Concentration

Capacitance

Fig. 5.10 Calibration curve obtained experimentally for three concentrations. The uptake with
antigens modifies the memristive behavior such as -|Vgap| increases with the increase of the antigen
concentration. For 0 fM concentration of antigens, it is considered that the voltage gap that appears is
created only due to the bio-functionalization with antibodies. The theoretical values for the voltage
gap are results obtained from simulations for different capacitance values (elaborated by [24])
5 Memristive Biosensors for Ultrasensitive … 149

increasing the concentration of antigens demands lower values for the capacitance
introduced to the equivalent circuit in order to achieve the same value for the volt-
age gap, with respect to this range of capacitance values. For zero concentration of
antigens (0 fM), only the voltage gap already created by the presence of antibodies
after the bio-functionalization process is considered.

5.5 Memristive Aptasensors

Aptamers are synthetic,2 single-stranded RNA or DNA oligonucleotides 15–60 base


in length. These nucleic acid ligands have small molecular weights (ranging from 5 to
15 kDa) and are chemically developed to bind with high specificity and selectivity to
a specific target analyte, like, for example, a protein, by undergoing a conformational
change. More specifically, the interaction of aptamers with the target is based on the
3D folding patterns. The complex 3D structure of the single-stranded oligonucleotide
is due to the intramolecular hybridization, which causes the folding into particular
shape. Aptamers fold into tertiary conformations and bind to their targets through
shape complementarity at the aptamer–target interface [34].
DNA aptamer, along with antibodies, is very suitable candidates for the design of
novel and highly specific biosensors. Moreover, DNA aptamers exhibit many advan-
tages such as the possibility for supporting continuous monitoring, enhanced stability,
specificity, and reproducibility. Moreover, the well-established synthesis protocol and
chemical modification technology is a key benefit that encloses the use of aptamers,
leading to rapid, large-scale synthesis and modification capacity that includes a vari-
ety of functional moieties, low structural variation during chemical synthesis, and
lower production costs. Aptamers can bind to nucleic acid, proteins, small organic
compounds, phospholipids, iron channels, and even whole cells [35, 36].

5.5.1 Memristive Aptasensors for Diagnostics

DNA-aptamer-based memristive biosensors so-called as memristive aptasensors


(biotinylated anti-PSA DNA aptamer solution (5-[biotin tag] TTT TTA ATT AAA
GCT CGC CAT CAA ATA GCT TT-3) were investigated for their analytical perfor-
mance at biomarker sensing for prostate cancer as a case of study. The PSA (PSA,
30 kDa Kallikrein protein) at different concentrations in the range of [aM-pM] was
then used as a model of target diagnostic molecules. Electrical characterization per-
formed indicated voltage gap openings after the bio-modification of the device with
DNA aptamers and increasing antigen concentration (Fig. 5.11a, b). An increasing
one-way trend for the voltage gap was recorded reaching saturation at some tens

2 Source of original text [3, 33].


150 I. Tzouvadaki et al.

(a) (b) PSA Dose Response


0.18

0.16

0.14
Log|Ids| [A]

Vgap [V]
0.12

0.10

PSA
0.08

DNA aptamers
0.06

0.04
10-17 10-16 10-15 10-14 10-13 10-12 10-11 10-10
Vds [V] PSA [M]

Fig. 5.11 Representative electrical characteristics and PSA dose response of memristive aptasensor:
Indicative electrical characteristics demonstrating the introduction of the voltage gap occurring upon
bio-modification of the surface of the nanodevice (a). Calibration curve related to the average voltage
gap versus dose response (b) (Reproduced with permission from [3]. Copyright 2016 American
Chemical Society)

Table 5.2 State-of-the-art list of reported PSA electrochemical aptasensors to date


Method Electrode surface LOD References
SWV GCE pM range [37]
EIS Gold electrodes 30 pM [38]
DPV GCE 7.6 pM [39]
EIS GCE 0.15 pM [40]
EIS Gold electrodes fM range [41]
EIS(capacitance Gold electrodes 30 fM [42]
measurements)
DPV GCE 300 aM [40]
Memristive aptasensor Si-nanowires 23 aM [3]
SWV square wave voltammetry, EIS electrochemical impedance spectroscopy, DPV differential
pulse voltammetry, GCE glassy carbon electrode

of pM. This outcome signifies that we are within and actually slightly below the
clinical range (critical level of PSA 4 ng/mL ca. 133 pM). This fact allows working
with highly diluted samples, significantly low volumes of clinical samples from the
patient are required and the detection at early stages can be achieved. An extremely
ultralow LOD of 23 aM was achieved thank to the implementation of the memristive
aptasensors. The LOD achieved was the best ever obtained among electrochemical
biosensors for PSA so far reported in literature (Table 5.2).
Furthermore, the nanofabricated structures are exposed to PSA prepared in non-
diluted, full human serum considering concentrations below the clinical range, offer-
ing a proof of the capability of the sensor to function in extremely low concentrations
of biomarkers and the acquisition of the increasing trend resulted by the introduction
of the increasing negative charge on the surface of the nanodevice.
5 Memristive Biosensors for Ultrasensitive … 151

5.5.2 Memristive Aptasensors for Therapeutics

Having demonstrated the direct and highly efficient response of the nanobiosen-
sor prototype to accurately follow the various steps of DNA aptamer binding-
regeneration cycle, the memristive properties of the nanosensors are further leveraged
for the label-free, ultrasensitive detection of therapeutics compounds (drugs), bring-
ing a completely new perspective for the label-free monitoring personalized and
precise medicine. Ultrasensitive drug screening is a key aspect in the field of thera-
peutics. As therapeutic compounds are going to be supplied in less and less concen-
tration, the need for more sensitive detectors presents immense importance. There-
fore, memristive aptasensors resulting in ultrasensitive sensing outputs with cancer
biomarkers were implemented for effective ultrasensitive drug screening as well.
The implementation of DNA aptamers also offers the potential for the nanosensor
regeneration, opening the way for continuous monitoring of therapeutic compounds,
a very significant requirement in therapeutics. To better show the performance of
the proposed new biosensors, Tenofovir (TFV), an antiviral drug for HIV treatment,
is considered here as a model drug. The therapeutic range concentration of TFV in
the circulatory plasma of some nM up to 860 nM. TFV-aptamers (5’-Aptamer-C6
Amino-3’) developed for specific interaction with TFV were immobilized on the
surface of the memristive devices and the detection was performed for drug con-
centrations belonging within and slightly below the clinical range, opening to the
possibility for future applications with minimum requirements of amount of clinical
samples (Fig. 5.12).

Ids
~
Vds
NiSi

NiSi
SiO2

Si

Fig. 5.12 Schematic representation illustrating the memristive sensor, and SEM micrograph depict-
ing the Si-NW arrays anchored between the NiSi pads, which serve as electrical contacts of the
freestanding memristive nanodevice. The position of the current minima for the forward and the
backward regimes changes after the surface treatment introducing a voltage difference in the semi-
logarithmic current-to-voltage characteristics (Reference [33]-Reproduced by permission of The
Royal Society of Chemistry)
152 I. Tzouvadaki et al.

(a) (b)

Voltage Gap [V]


Voltage Gap [V]

Blank Blank
Concentration [pM] Concentration [nM]

Fig. 5.13 Analytical performance and effective drug detection through the electrical hysteresis
variations in buffer (a) and in full human serum (b). For the in serum detection, a new drug detection
is performed following regeneration of the memristive aptasensor. The response of the sensor to
the new drug binding fits ideally the calibration curve obtained initially. The exposure of the sensor
directly after, to a nonspecific drug, does not result in any signal difference (Reference [33]-
Reproduced by permission of The Royal Society of Chemistry)

The successive uptakes of negative charge at the nanodevice surface led to an


increasing trend of the voltage gap following the increasing concentration of the
detection target drug. An increasing trend of this parameter of the hysteresis is then
depicted following the dose increase (Fig. 5.13a) till the value of 193 ± 51 mV for
100 nM, the highest concentration implemented for the case of buffer solution. In
human serum, a hysteresis modification (69 ± 38 mV) is initially indicated for a con-
centration of 100 fM and reaches finally the value 295 ± 61 mV for 1 µM (Fig. 5.13b).
At the end of the dose-response cycle in the human serum, a regeneration step was per-
formed and an intermediate TFV concentration of 1 nM was applied. It was depicted
that indeed the signal obtained for the hysteresis voltage gap (97 ± 31 mV) was
back to the value foreseen by the previously recorded dose-response curve. This
results clearly demonstrate the efficiency and consistency of the proposed method,
and its applicability for continuous monitoring of therapeutic compounds as well.
Furthermore, the implementation of a negative control drug, enzalutamide (a widely
used anti-prostate cancer drug) was performed as an additional step. Indicatively, the
voltage gap (89 ± 35 mV) obtained for the negative control brought no significant
hysteresis modification (<9% difference) for the very same drug concentration of
1 nM, thus no drug binding occurred for the negative control case, exhibiting the
extremely good specificity of the methodology applied. A very low LOD of 3.09 pM
for PBS buffer and of 1.38 nM for human serum was demonstrated achieving opti-
mum performance comparing to the so far reported state of the art for drug detection,
in general, and for TFV in particular (Table 5.3). These LODs demonstrated 10 times
better performance for the in buffer drug detection with respect to the literature and
show twice better performance for drug sensing in human serum, ever obtained.
5 Memristive Biosensors for Ultrasensitive … 153

Table 5.3 State-of-the-art list of reported drug detection to date


Target drug Method Surface LOD (nM) Linear References
Buffer/bio-matrix Range (nM)
ART Ampero GrPnano 0.031/ 0.035+ 0.13–1 [43]
CLE Voltam BZ GNPs –/43.96• 100–800 [44]
CAM Ampero Cds-NPs/GNPs 0.14/– 0.15–2.94 [45]
Gleevec Conduc Si-NW –/– Up to 100 [46]
6-MP Voltam GRA/Ppy/MWCNTs –/80* 200–100000 [47]
PCM Voltam MWCNTs 2.9/– 5–1000 [48]
PCN Voltam B-diam –/320** 400–100000 [49]
TAM Voltam Enz/PN/Pt 0.2/– 27–297 [50]
TFV BSI Glass chip 2.5/– Up to 20 [51]
TFV LC-MS – –/680++ 1360–350000 [52]
TFV Voltam HMDE 450/870 Up to 17000 [53]
TFV LC-UV – –/10.4 35–3480 [54]
TFV LC-MS – –/7 35–3480 [54]
TFV Memristive NW arrays 0.0031/1.38 0.001–1000 [33]
ART artesunate, CLE clenbuterol, CAM chloramphenicol, PCM paracetamol, PCN penicillin, TAM
tamoxifen, 6-MP 6-mercaptopurine, Voltam voltammetry, Conduc conductance, Ampero amperom-
etry, BSI back scattering interferometry, LC liquid chromatography, MS mass spectroscopy, UV
ultra violet, GrPnano Gr-polyaniline nanocomposite, BZ GNPs benzedithiol-gold nanoparticles,
GRA/Ppy graphite/polypyrrole, B-diam boron-dopped diamond, Enz/PN enzyme/polyaniline
*Diluted human urine sample; **Human urine sample; • Diluted rat urine sample; + Highly diluted
human serum; ++ Human urine;  Diluted/precipitated human plasma;  Full human serum

5.5.3 Aptamer Regeneration Through the Perspective


of Memristive Phenomena

The regeneration properties of the DNA aptamers were directly reflected on the
memristive aptasensors response as expressed through a voltage difference appear-
ing in the semi-logarithmic current-to-voltage characteristics. The whole sensing
cycle consisting of aptamers immobilization, target-drug binding, aptamer regenera-
tion, and target-drug rebinding was for first time portrayed through the variations of
the electrical hysteresis of the memristive nanostructures. A series of denaturation,
pH shocking, and refolding is used for surface regeneration (unbound the target from
aptamer and restore the aptamers to their previous folded and functional form). After
treatment with DNA aptamers, the electrical response of the nanodevices indicated
a voltage difference of 116 ± 34 mV. Following the nanodevices exposure to the
negatively charged target drug solution of 100 nM, an increase of the voltage differ-
ence occurs as an aftereffect of the drug binding, reaching the value of 156 ± 24 mV
(increase of 34.5%). However, the most promising outcome undoubtedly lays on the
electrical response after the regeneration of aptamers that follows the drug binding.
It was demonstrated that the voltage difference was decreased after the regeneration
process back to the blank level (to the value of 120 ± 15 mV), namely, at the level
that corresponds to the value obtained initially just after the aptamer immobilization
on the surface. In order to verify the reliability of the nanobiosensor in repeated mea-
sures and the capability for further capturing and detecting the target molecules after
154 I. Tzouvadaki et al.

the aptamers regeneration, and the biosensors response to a different concentration of


the sensing target, the very same nanodevices were anew exposed to the drug solution
but this time, a higher concentration of 1 µM is introduced. A higher concentration of
reagent implies higher negative charge density and, as expected, the uptake of higher
concentration of the drug resulted in a more pronounced modification of the voltage
gap. In this case, a 100% increased signal was acquired and the voltage difference
indicates the value of 240 ± 23 mV that was twice the signal exhibited by DNA
aptamers. It is worth to mention that the percentage of the signal difference for the
hysteresis modification between the two different drug concentrations is estimated
around 53.8%. Finally, a further regeneration step was carried out to illustrate the
nanobiosensors repeated regeneration character. It is indeed demonstrated that the
voltage difference (121 ± 39 mV) registered after this further regeneration process
coincided with the value achieved initially for the aptamer immobilization on the
surface that was also the same to the one obtained after the first regeneration. That
means the biosensors is very reproducibly down to the original value of the volt-
age gap after any regeneration process and, therefore, ready for a new measure of
the target molecule concentration. The reproducibility of the device presented as the
percentage differences between the hysteresis modification signals obtained after the
initial aptamer immobilization and the two regeneration procedures applied is around
3.4% and 4.3% for these two regenerations, respectively. Those percentages can be
considered negligible indeed, since these differences are approximately five times
less the values of measurement errors usually obtained in detection on drug binding.
Overall, it is worth highlighting that all these findings indicate a clear evidence that
DNA aptamer natural characteristics to binding to drugs and related functions may
indeed be efficiently transduced by using the memristive phenomena (Fig. 5.14).

0.3
5. Drug Binding
Voltage Gap [V]

0.2
3. Drug Binding

4. Aptamer Regeneration 6. Aptamer Regeneration


2. Aptamer Immobilization

0.1

0.0
K

nM

uM

2
N

N
LA

IO

IO
0

1
10

FV
T

T
B

A
FV

T
R

R
E

E
T

N
E

E
G

G
E

E
R

Fig. 5.14 DNA aptamer immobilization, target molecule binding, and DNA aptamer regenera-
tion cycle, illustrated through the electrical hysteresis variations (Reference [33]-Reproduced by
permission of The Royal Society of Chemistry)
5 Memristive Biosensors for Ultrasensitive … 155

5.6 Conclusions

As we have seen in this chapter, the coupling of memristive effect with biological pro-
cesses gives us innovative nanobiosensing new technologies with un-precedent high
performance in both diagnostics and therapeutics. Silicon nanowire arrays exhibiting
a memristive effect and sophisticated bio-functionalization give rise to now kind of
biological sensors: the Memristive Biosensors. This completely new class of biosen-
sors successfully addresses the issues of an early detection of cancer since it provides
high performance, ultrasensitive, label-free electrochemical sensing of extremely
small traces of cancer biomarkers, such as the PSA, as well as effective screen-
ing and continuous monitoring of therapeutic compounds, such as TFV, also in full
human undiluted serum. This powerful capability of such a new approach opens to
new solutions in the medical practice, especially in the field of personalized and
precision medicine.

References

1. M.-A. Doucey, S. Carrara, Nanowire sensors in cancer. Trends Biotechnol. (2018)


2. S. Carrara, D. Sacchetto, M.-A. Doucey, C. Baj-Rossi, G.D. Micheli, Y. Leblebici, Memristive-
biosensors: a new detection method by using nanofabricated memristors. Sens. Actuators B
Chem. 171–172, 449–457 (2012)
3. I. Tzouvadaki, P. Jolly, X. Lu, S. Ingebrandt, G. de Micheli, P. Estrela, S. Carrara, Label-free
ultrasensitive memristive aptasensor. Nano Lett. 16(7), 4472–4476 (2016)
4. Y.V. Pershin, M. Di Ventra, Memory effects in complex materials and nanoscale systems. Adv.
Phys. 60(4), 145–227 (2011)
5. F. Puppo, M. Di Ventra, G. De Micheli, S. Carrara, Memristive sensors for ph measure in dry
conditions. Surf. Sci. 624, 76–79 (2014)
6. I. Tzouvadaki, N. Madaboosi, I. Taurino, V. Chu, J.P. Conde, G. de Micheli, S. Carrara, Study
on the bio-functionalization of memristive nanowires for optimum memristive biosensors. J.
Mater. Chem. B 4(12), 2153–2162 (2016)
7. T. Mauser, C. Déjugnat, G.B. Sukhorukov, Reversible ph-dependent properties of multilayer
microcapsules made of weak polyelectrolytes. Macromol. Rapid Commun. 25(20), 1781–1785
(2004)
8. H. Riegler, F. Essler, Polyelectrolytes. 2. intrinsic or extrinsic charge compensation? Quanti-
tative charge analysis of pah/pss multilayers. Langmuir 18(8), 6694–6698 (2002)
9. W. Chen, T.J. McCarthy, Layer-by-layer deposition: a tool for polymer surface modification.
Macromolecules 30(1), 78–86 (1997)
10. F. Puppo, M. Doucey, J. Delaloye, T.S.Y. Moh, G. Pandraud, P.M. Sarro, G.D. Micheli, S.
Carrara, Sinw-fet in-air biosensors for high sensitive and specific detection in breast tumor
extract. IEEE Sens. J. 16(10), 3374–3381 (2016)
11. F. Puppo, A. Dave, M. Doucey, D. Sacchetto, C. Baj-Rossi, Y. Leblebici, G.D. Micheli, S. Car-
rara, Memristive biosensors under varying humidity conditions. IEEE Trans. NanoBioscience
13(1), 19–30 (2014)
12. I. Tzouvadaki, J. Zapatero-Rodriguez, S. Naus, G. de Micheli, R. O’Kennedy, S. Carrara,
Memristive biosensors based on full-size antibodies and antibody fragments. Submitted to
Sens. Actuator B-Chem
13. D.B. Strukov, G.S. Snider, D.R. Stewart, R.S. Williams, The missing memristor found. Nature
453(5), 80–83 (2008)
156 I. Tzouvadaki et al.

14. M.D. Ventra, Y.V. Pershin, L.O. Chua, Circuit elements with memory: memristors, memca-
pacitors, and meminductors. Proc. IEEE 97(10), 1717–1724 (2009)
15. A. Gelencsér, T. Prodromakis, C. Toumazou, T. Roska, Biomimetic model of the outer plexiform
layer by incorporating memristive devices. Phys. Rev. E 85(4), 041918 (2012)
16. J.J. Yang, M.D. Pickett, X. Li, D.A. Ohlberg, D.R. Stewart, R.S. Williams, Memristive switching
mechanism for metal/oxide/metal nanodevices. Nat. Nanotechnol. 3(7), 429–433 (2008)
17. S. Shin, K. Kim, S.M. Kang, Compact models for memristors based on charge-flux constitutive
relationships. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 29(4), 590–598 (2010)
18. D. Biolek, M. Di Ventra, Y.V. Pershin, Reliable spice simulations of memristors, memcapacitors
and meminductors. Radioengineering 22 (2013)
19. I. Vourkas, A. Batsos, G.C. Sirakoulis, Spice modeling of nonlinear memristive behavior. Inter.
J. Circuit Theory Appl. 43(5), 553–565 (2015). CTA-13-0128
20. F. Puppo, F.L. Traversa, M.D. Ventra, G.D. Micheli, S. Carrara, Surface trap mediated electronic
transport in biofunctionalized silicon nanowires. Nanotechnology 27(34), 345503 (2016)
21. Z. Biolek, D. Biolek, V. Biolkova, Spice model of memristor with nonlinear dopant drift.
Radioengineering 18(2), 210–214 (2009)
22. S. Benderli, T.A. Wey, On spice macromodelling of tio2 memristors. Electron. Lett. 45(7),
377–379 (2009)
23. A. Rak, C. Gyorgy, Macromodeling of the memristor in spice. Trans. Comp. Aided Des. Integ.
Circuit Sys. 29, 632–636 (2010)
24. I. Tzouvadaki, F. Puppo, M. Doucey, G.D. Micheli, S. Carrara, Computational study on the
electrical behavior of silicon nanowire memristive biosensors. IEEE Sens. J. 15(11), 6208–6217
(2015)
25. S.H. Lee, Y.S. Yu, S.W. Hwang, D. Ahn, A spice-compatible new silicon nanowire field-effect
transistors (snwfets) model. IEEE Trans. Nanotechnol. 8, 643–649 (2009)
26. H. Elhadidy, J. Sikula, J. Franc, Symmetrical current-voltage characteristic of a metal-
semiconductor-metal structure of schottky contacts and parameter retrieval of a cdte structure.
Semicond. Sci. Technol. 27(1), 015006 (2012)
27. S. Lee et al., Equivalent circuit model of semiconductor nanowire diode by spice. J. Nanosci.
Nanotechnol. (2007)
28. C.Y. Yim, et al., Electrical properties of the zno nanowire transistor and its analysis with
equivalent circuit model. J. Korean Phys. Soc. 48, 1565–1569 (2006)
29. K. Steiner, Capacitance-voltage measurements on schottky diodes with poor ohmic contacts.
IEEE Trans. Instrum. Meas. 42(1), 39–43 (1993)
30. M. Bleicher, E. Lange, Schottky-barrier capacitance measurements for deep level impurity
determination. Solid State Electron. 16(3), 375–380 (1973)
31. P.S. Ho, E.S. Yang, H.L. Evans, X. Wu, Electronic states at silicide-silicon interfaces. Phys.
Rev. Lett. 56(1), 177–180 (1986)
32. J. Werner, A.F.J. Levi, R.T. Tung, M. Anzlowar, M. Pinto, Origin of the excess capacitance at
intimate schottky contacts. Phys. Rev. Lett. 60(1), 53–56 (1988)
33. I. Tzouvadaki, N. Aliakbarinodehi, G. de Micheli, S. Carrara, The memristive effect as a novelty
in drug monitoring. Nanoscale 9(27), 9676–9684 (2017)
34. S. Shigdar, J. Lin, Y. Yu, M. Pastuovic, M. Wei, W. Duan, Rna aptamer against a cancer stem
cell marker epithelial cell adhesion molecule. Cancer Sci. 102(5), 991–998 (2011)
35. D.H.J. Bunka, P.G. Stockley, Aptamers come of age -at last. Nat. Rev. Microbiol. 4(8), 588–596
(2006)
36. E. Levy-Nissenbaum, A.F. Radovic-Moreno, A.Z. Wang, R. Langer, O.C. Farokhzad, Nan-
otechnology and aptamers: applications in drug delivery. Trends Biotechnol. 26(8), 442–449
(2008)
37. M. Souada, B. Piro, S. Reisberg, G. Anquetin, V. Noël, M. Pham, Label-free electrochemical
detection of prostate-specific antigen based on nucleic acid aptamer. Biosens. Bioelectron. 68,
49–54 (2015)
38. P. Jolly, N. Formisano, J. Tkáč, P. Kasák, C.G. Frost, P. Estrela, Label-free impedimetric
aptasensor with antifouling surface chemistry: a prostate specific antigen case study. Sens.
Actuators B Chem. 209, 306–312 (2015)
5 Memristive Biosensors for Ultrasensitive … 157

39. B. Liu, L. Lu, E. Hua, S. Jiang, G. Xie, Detection of the human prostate-specific antigen
using an aptasensor with gold nanoparticles encapsulated by graphitized mesoporous carbon.
Microchim. Acta 178(1), 163–170 (2012)
40. B. Kavosi, A. Salimi, R. Hallaj, F. Moradi, Ultrasensitive electrochemical immunosensor for
psa biomarker detection in prostate cancer cells using gold nanoparticles/pamam dendrimer
loaded with enzyme linked aptamer as integrated triple signal amplification strategy. Biosens.
Bioelectron. 74, 915–923 (2015)
41. Z. Yang, B. Kasprzyk-Hordern, S. Goggins, C.G. Frost, P. Estrela, A novel immobilization
strategy for electrochemical detection of cancer biomarkers: DNA-directed immobilization of
aptamer sensors for sensitive detection of prostate specific antigens. Analyst 140(8), 2628–2633
(2015)
42. P. Jolly, V. Tamboli, R.L. Harniman, P. Estrela, C.J. Allender, J.L. Bowen, Aptamer-mip hybrid
receptor for highly sensitive electrochemical detection of prostate specific antigen. Biosens.
Bioelectron. 75, 188–195 (2016)
43. K. Radhapyari, P. Kotoky, M.R. Das, R. Khan, Graphene-polyaniline nanocomposite based
biosensor for detection of antimalarial drug artesunate in pharmaceutical formulation and bio-
logical fluids. Talanta 111, 47–53 (2013)
44. B. Bo, X. Zhu, P. Miao, D. Pei, B. Jiang, Y. Lou, Y. Shu, G. Li, An electrochemical biosensor
for clenbuterol detection and pharmacokinetics investigation. Talanta 113(9), 36–40 (2013)
45. D.-M. Kim, M.A. Rahman, M.H. Do, C. Ban, Y.-B. Shim, An amperometric chloramphenicol
immunosensor based on cadmium sulfide nanoparticles modified-dendrimer bonded conduct-
ing polymer. Biosens. Bioelectron. 25(3), 1781–1788 (2010)
46. W.U. Wang, C. Chen, K.-H. Lin, Y. Fang, C.M. Lieber, Label-free detection of small-molecule-
protein interactions by using nanowire nanosensors. Proc. Natl. Acad. Sci. USA 102(3), 3208–
3212 (2005)
47. H. Karimi-Maleh, F. Tahernejad-Javazmi, N. Atar, M.L. Yola, V.K. Gupta, A.A. Ensafi, A novel
DNA biosensor based on a pencil graphite electrode modified with polypyrrole/functionalized
multiwalled carbon nanotubes for determination of 6-mercaptopurine anticancer drug. Ind.
Eng. Chem. Res. 54(4), 3634–3639 (2015)
48. R.N. Goyal, V.K. Gupta, S. Chatterjee, Voltammetric biosensors for the determination of parac-
etamol at carbon nanotube modified pyrolytic graphite electrode. Sens. Actuators B Chem.
149(8), 252–258 (2010)
49. Ľ. Švorc, J. Sochr, P. Tomčík, M. Rievaj, D. Bustin, Simultaneous determination of paraceta-
mol and penicillin V by square-wave voltammetry at a bare boron-doped diamond electrode.
Electrochim. Acta 68(4), 227–234 (2012)
50. K. Radhapyari, P. Kotoky, R. Khan, Detection of anticancer drug tamoxifen using biosensor
based on polyaniline probe modified with horseradish peroxidase. Mater. Sci. Eng. C Mater.
Biol. Appl. 33, 583–587 (2013)
51. M.N. Kammer, I.R. Olmsted, A.K. Kussrow, M.J. Morris, G.W. Jackson, D.J. Bornhop, Charac-
terizing aptamer small molecule interactions with backscattering interferometry. Analyst 139,
5879–5884 (2014)
52. M. Simiele, C. Carcieri, A. De Nicolò, A. Ariaudo, M. Sciandra, A. Calcagno, S. Bonora, G. Di
Perri, A. D’Avolio, A LC-MS method to quantify tenofovir urinary concentrations in treated
patients. J. Pharm. Biomed. Anal. 114(10), 8–11 (2015)
53. R. Jain, R. Sharma, Cathodic adsorptive stripping voltammetric detection and quantification
of the antiretroviral drug tenofovir in human plasma and a tablet formulation. J. Electrochem.
Soc. 160(8), H489–H493 (2013)
54. M.E. Barkil, M.-C. Gagnieu, J. Guitton, Relevance of a combined uv and single mass spec-
trometry detection for the determination of tenofovir in human plasma by hplc in therapeutic
drug monitoring. J. Chromatogr. B 854(7), 192–197 (2007)
Chapter 6
Optimized Programming
for STT-MTJ-Based TCAM
for Low-Energy Approximate Computing

Ashwani Kumar and Manan Suri

Abstract In the advent of data-driven systems and processes, high speed and
energy-efficient computing techniques are highly desirable. Such systems and tech-
niques are already being employed in many applications, which mainly depends on a
huge amount of data like information analysis, transmission, policy, decision-making,
etc. An electronic system used in these applications, require to perform the opera-
tions like data capture, storage, visualization, and -analysis. Most of such systems
employ content addressable memories (CAMs), also known as associative memo-
ries for high-speed data search/compare and compute operation. In this chapter, an
optimized programming scheme for magnetic tunnel junction (MTJ) based resistive
ternary content addressable memory (ReTCAM) for approximate computing (AC)
application is presented. Basic key concepts related to MTJ structure, physics, elec-
trical behavior, bit-cell design, and AC are also discussed. Error-tolerant behavior
of AC and stochastic writing of ReTCAM cell are exploited to achieve low write
energy. Case study of 3-bit (LSB) write operation using the proposed programming
scheme is also investigated based on distance match accuracy. ReTCAM bit-cell is
designed using perpendicular magnetic anisotropic (PMA) MTJ device with 32 nm
diameter and 90 nm CMOS technology.

6.1 Introduction

Data and network-centric applications involve operations like pattern matching, clas-
sification, similarity search, and online learning [1–3]. Such computations require
massive parallel data processing (associative computing) to achieve high-speed and
efficient computational capability [4].

A. Kumar · M. Suri (B)


Department of Electrical Engineering, Indian Institute of Technology Delhi,
New Delhi 110016, India
e-mail: [email protected]
A. Kumar
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2020 159


M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_6
160 A. Kumar and M. Suri

Associative memories specifically ternary content addressable memory (TCAM)


[5] based architectures are proposed to handle such high level computations [6, 7].
TCAM based architectures are considered highly efficient to handle the computa-
tion in the applications like similarity search [8], pattern matching [9], classification
[10], online learning [11] etc. Recently, researchers have been trying to minimize
the area as well as boost up the performance of TCAM architectures to handle the
ever increasing data requirements in the aforementioned application domains. The
implementation of hybrid- resistive TCAMs (ReTCAM) using the magnetic tunnel
junction (MTJ) based resistive nonvolatile memory devices is very promising and
shows significant area reduction and performance improvements. The integration
of MTJ devices with CMOS logic also provides an inherent nonvolatile storage,
and hence negligible standby power. Furthermore, MTJ devices exhibit inherent
switching stochasticity and variability. The impact of MTJ’s stochasticity, and vari-
ability on the hybrid TCAM write operations needs to been investigated as it has
not been explored much in literature. The possible reason might be that convention-
ally, TCAMs were used for high-speed search/match operations [5] which generally
involve the read operations. However, modern computing systems utilize services
like software-defined networking, memory-based computing models, architectures,
etc. Therefore, regular reconfiguration and updating of the TCAM banks [7] would
be required and it would primarily involve the TCAM write operation. In such cases,
write operation in TCAM will also play a significant role in the overall system per-
formance.
In our recent works [12], the impact of MTJ stochasticity switching and variability
on 4T-2MTJ based ReTCAM has been studied. MTJ’s stochastic switching has been
exploited for writing the least significant bits (LSBs) within a ReTCAM entry. This
chapter summarizes the aforementioned technique, as well as, presents a study based
on the relationship between MTJ’s switching stochasticity, variability and ReTCAM
energy consumption for different error tolerances (ET). Furthermore, it presents the
ET levels using specific LSB write patterns which also contains the don’t care “X”
state. MTJ’s average switching time and switching probability skewness using the
compact model [13, 14] are also analyzed and presented.
This chapter is organized in different sections: Sect. 6.1 is dedicated to TCAM
Basic operation, cell structures and its application in different computing applica-
tions. Section 6.3 presents the basics of MTJ/spin transfer torque (STT) based devices,
stochastic programming, and the MTJ compact model used for this study. Section 6.3
presents the operation of 4T-2MTJ ReTCAM cell and its variability analysis. Section
6.4 illustrates the proposed probabilistic write scheme for low energy consumption
and results for AC application. Section 6.5 discusses the performance improvement
using ReTCAM cell and Sect. 6.6 lists the conclusions.
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 161

Fig. 6.1 Illustrating TCAM


cell entries and search
operation against a
user/search data

6.1.1 TCAM Basics, Cell Structures and Usage


for Computing

6.1.1.1 TCAM Basics

TCAM is an electronic memory, which is primarily used for high-speed search or


matching operations. A single TCAM cell can be programmed to three different logic
values such as “0”, “1”, and “X” (don’t care) unlike content addressable memory
(CAM) cell that stores only “0” and “1”. The additional storage of “X” provides
more flexibility in search operations and becomes quite useful in applications like
packet matching, forwarding at Internet routers, high-speed IP lookup table search
[15, 16], etc. A TCAM entry (a row of TCAM cells) contains a fixed size data. User
data is the input to TCAM in a search operation, and is compared against the data
stored in the TCAM entries. Based on the match result, TCAM provides the location
of complete data stored in the other memory as shown in Fig. 6.1. Sometimes TCAM
also provides the complete data or other information associated with the searched
data, hence CAMs/TCAMs are also called as associative memories.

6.1.1.2 TCAM Core Cell and Circuit Level Implementation

There are several existing designs and architectures including both pure CMOS based
and emerging nonvolatile memory (NVM) based TCAM. There are few primary and
important design parameters for TCAM hardware implementation [7], i.e.,
1. Energy: Due to massive parallel operations, TCAMs are bound to consume large
energy. Hence, energy-efficient designs are highly desirable.
2. Area: TCAM is used in constraint environment for a very specific purpose. The
need for parallel architecture leads to the large footprint that mainly depends on
TCAM cell area. The need of area efficient architecture is unavoidable.
3. Speed: The main purpose to employ a TCAM is its speed of the operation. A fast
TCAM is always preferable.
4. Noise Margin: TCAM cell has to store three different Logic values, unlike CAM
which stores two. An optimized and sufficient noise margin among all the states
of TCAM is required to avoid any logic error while read/search operation.
162 A. Kumar and M. Suri

Fig. 6.2 Different implementations of ReTCAM bit cells using MTJ devices: a 9T-2MTJ cell having
nine transistors and two MTJ devices; used to design a dual rail TCAM for soft error tolerance [22],
b 6T-2MTJ cell is used to design segmented match line (ML) to reduce the dynamic search power
[23], and c 4T-2MTJ cell with comparatively lower footprint with high-speed operation [24]

Pure CMOS-based implementations of TCAM cells consist of SRAM-based


architecture like 16T [5] to 12T [17], or DRAM based with 6T-2C configuration
[18]. SRAM and DRAM storage cells suffer from multiple issues, such as higher
power dissipation, leakage current, low density, and volatility. Furthermore, con-
tinuous transistor scaling led to higher leakage power and lower reliability [19].
Higher energy consumption by CMOS-based TCAMs limit their usage to network
applications, such as classification [20] and online learning [11]. Recently, several
MTJ-based highly optimized nonvolatile TCAM designs have been proposed in [21–
24] etc to address these issues. Figure 6.2 shows a few different architectures of
nonvolatile TCAM cells using MTJ devices proposed in the literature so far. These
TCAM cells have their own pros and cons in terms of aforementioned key design
parameters. Result for a search operation is decided based on the voltage or current
values at match line (ML) with some delay from start of the operation. Detailed
working operation of ReTCAM cells is explained in the Sect. 6.3.

6.2 Overview of MTJ Device and Compact Model

6.2.1 Overview of MTJ Device

MTJ device stack consists of several material layers. However, three layers (one oxide
layer and two magnetic layers) play an important role. Thin oxide layer is inserted
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 163

Fig. 6.3 MTJ nanopillar


stack with thin films of
CoFeB/MgO/CoFeB. MTJ
resistive switching due to
STT effect from P → AP
and AP → P. IMTJ is the
current through MTJ; IC_PAP
is the critical current to
switch to antiparallel state;
IC_APP is the critical current
to switch to parallel state

between the two magnetic layers. One of the two magnetic layers is called as free
layer and another one is called as a fixed/pinned layer. Figure 6.3 illustrates the MTJ
device stack. The spin transfer torque (STT) based perpendicular magnetization MTJ
(PMTJ) functionality are enabled mainly by two phenomena:
1. STT effect for writing.
2. The tunnel magnetoresistance (TMR) effect for reading.
The magnetic orientation of the free layer can be switched reversibly between
parallel (P) and antiparallel (AP) with respect to the magnetic orientation of fixed
layer, by using STT mechanism [25]. The magnetization switching of the free layer
is achieved by passing a current through the stack with appropriate amplitude and
polarity (Fig. 6.3). Based on P and AP magnetization of free layer with reference to
the fixed layer, MTJ structure results in two different electrical resistances referred
to as RP (low resistance state) and RAP (high resistance state), respectively, as shown
in Figs. 6.3 and 6.4a. The sensing/reading performance of a MTJ depends on TMR
ratio. High TMR ratio helps in accurate detection of the MTJ resistance states both
in memory and Logic circuits. The TMR ratio is defined by the (6.1) [13].

(RAP − RP )
TMR Ratio = (6.1)
RP

The TMR ratio is directly linked to calculate the resistance window between P and
AP states as shown in Fig. 6.4a. MTJ devices show an inherent switching probability
based on the applied switching conditions, also referred to as switching stochasticity
[14]. One such case of switching probability or stochasticity is illustrated in Fig.
6.4b) where MTJ undergoes switching during the first cycle but does not switch
in the second cycle even with the application of same voltage pulse (Vpulse ). This
switching stochasticity is referred to as intrinsic switching variability that impacts
the cycle-to-cycle switching of the MTJ device.
164 A. Kumar and M. Suri

6.2.2 STT Based MTJ Device Compact Model

For the analysis presented in this chapter, a Verilog-A compact model [14] of the
STT-PMTJ is used for all circuit simulations. This model includes both intrinsic as
well as extrinsic variability of MTJ device. Intrinsic variability affects the cycle to
cycle switching whereas extrinsic variability affects device-to- device switching of
the MTJ devices. Switching conditions and statistics are studied numerically and
condensed in an analytical model, as described in [14]. Switching distribution is
described by a gamma distribution, whose mean and skewness are obtained as a
function of the ratio of the programming voltage to the critical voltage (Vc), as
shown in Fig. 6.4c.

Fig. 6.4 a MTJ resistance as


a function of voltage,
showing stochastic switching
phenomenon, from Monte
Carlo simulation, b
Illustration of the stochastic
programming under
application of a voltage
pulse, c Mean switching time
and skewness of the gamma
probability distribution
function for switching, as a
function of the programming
voltage Vpulse and its ratio to
critical voltage (Vpulse =
2Vc). Inset shows switching
PDF for the case Vpulse =
2Vc
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 165

Table 6.1 Magnetic tunnel junction parameters


Parameters Parameter description Value
d MTJ cross section diameter 32 nm
tfl Free layer thickness 1.3 nm
RA product Resistance area 5
product/coefficient
TMR Tunnel magnetoresistance ratio 120%
Vh Voltage of half TMR 0.5 V
ΔE/KB T Relative energy barrier 60

Fig. 6.5 Switching time


distribution of single MTJ
device over 1000 cycles at a
fixed voltage (0.8 V)

For voltages that are significantly higher than the critical voltage Vc, it is observed
that the switching probability distribution versus pulse duration resembles a Gaussian
distribution, with average switching time inversely proportional to Vpulse . As the
voltage is decreased, the average switching time tends to increase exponentially.
Furthermore, the skewness also increases, thus making the probability switching
distribution function more asymmetric. Inset of Fig. 6.4c illustrates the switching
distribution of the MTJ at switching voltage, i.e., Vpulse = 2Vc. Table 6.1 lists some
of the MTJ device’s physical parameters that are used for the circuit level simulations.
TMR ratio (presented in Table 6.1) as 120% at room temperature has been reported
for same PMTJ stack with CoFeB/MgO/CoFeB layers [26].
MTJ devices exhibit device-to-device switching variation due to the process varia-
tion (difference in device’s geometry and other physical parameters) [13, 27]. Hence,
the programming pulse for obtaining a particular switching probability (say 50% )
is different for different MTJ devices. The Model helps to predict the mean switch-
ing time of the devices under the intrinsic and extrinsic variations. Facilitation of
Monte Carlo simulations-based analysis over MTJ’s intrinsic and extrinsic variation
projects the complete behavior of MTJ devices [28]. A switching time distribution
while switching the MTJ device from AP → P state over 1000 cycles is shown in
Fig. 6.5.
The mean switching time of a single MTJ is extracted from this distribution. Fur-
thermore, mean switching times for 1000 MTJ devices (considering spatial device-
to-device variability) are also shown Fig. 6.6 by using Monte Carlo simulations [28].
166 A. Kumar and M. Suri

Fig. 6.6 Illustrating the


different mean switching
times of 1000 MTJ devices.
Y-axis represents the number
of MTJ devices switched
corresponds to different
mean switching times on
x-axis

6.3 Working Principal of 4T-2MTJ ReTCAM Cell

This section explains the ReTCAM cell schematic, its working and summarized the
effect of MTJ variability on the TCAM cell operation.

6.3.1 4T-2MTJ ReTCAM Cell

The ReTCAM cell is designed to store three different logic states, i.e., “0”, “1”,
and “X” (Don’t-Care). An area efficient nonvolatile hybrid 4T-2MTJ ReTCAM cell
structure used in this chapter is inspired from [24]. 4T-2MTJ cell is shown in Fig. 6.7.
Searching and writing are the two basic operations of the ReTCAM cell. Table 6.2
illustrates all Logic states of a ReTCAM cell and corresponding resistance states of
the MTJ devices. 4T-2MTJ ReTCAM cell structure has two MTJ devices (MTJ1 and
MTJ2), two nMOS comparison transistors (NM1 and NM2), a write control pMOS
transistor (PM1), and a match line-driver nMOS transistor (NM3, connected in diode
configuration). Gate and drain of NM3 are connected to the ML. Data signals as well

Fig. 6.7 Nonvolatile


4T-2MTJ ReTCAM cell
schematic illustrating the
multiplexing of the signals
for write (blue) and read
operation (red). Also, it
illustrates the current
directions for write and read
operation
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 167

Table 6.2 ReTCAM Logic state mapping to MTJ’s resistance states


ReTCAM cell stored logic MTJ1 orientation/Resistance MTJ2 orientation/Resistance
state state
Logic “0” P (0)/Low AP (1)/High
Logic “1” AP (1)/High P (0)/Low
Don’t-Care (“X”) AP (1)/High AP (1)/High
Forbidden P (0)/Low P (0)/Low

Table 6.3 ReTCAM cell truth table for search and write operations
Operation ML BL BLb PM1 NM3 NM1 NM2
Search High – – OFF ON OFF ON
“0”
Search High – – OFF ON ON OFF
“1”
Write “0” Low Low High ON OFF ON OFF
Step-1
Write “0” Low High Low ON OFF OFF ON
Step-2
Write “1” Low High Low ON OFF ON OFF
Step-1
Write “1” Low Low High ON OFF OFF ON
Step-2
Write “X” Low High Low ON OFF ON ON

as control signals for both search and write operations are multiplexed on the same
lines so that only one set of signals is required at a time (Fig. 6.7).
Search Operation: During search operation, every ReTCAM cell compares an
input bit/logic on its SL and SLb lines with the prestored bit in ReTCAM cell as
illustrated in Table 6.2. ML is always pre-charged to a fixed potential before beginning
with a search operation. For searching “0” as well as “1” in the ReTCAM cell, the
corresponding transistors operation are illustrated in Table 6.3. For search “1”, SL
and SLb will be at high and low voltages, respectively and vice versa for search
“0” operation. Either NM1 or NM2 creates a discharge path for the ML, and then
resistances of MTJ1 and MTJ2 decide the amount of discharge within a fixed time
frame.
ML potential is then compared with a threshold value. If ML is higher than the
threshold, the result is match otherwise, the result is a mismatch. The ML discharge
current through the cell can be optimized by (1) adjusting the initial precharge of ML,
(2) size of NM3, and (3) gate control voltages and sizes of access transistors (NM1
and NM2). SL and SLb voltage levels can be tuned corresponding to the desired
noise margin and operation speed. In this study, ML is pre-charged to 1.2 V and we
observed ∼10 µA and ∼15.5 µA as match and mismatch currents, respectively, with
168 A. Kumar and M. Suri

a noise margin of ∼312 mV at search latency of 140 ps. If a cell is in the don’t care
state “X”, both MTJ devices will be in their high resistance state, and the result of
any search operation on the cell will always be a match.
Write Operation: The write signals are multiplexed with the search signals on
the same line as shown in Fig. 6.7. It can be figured out from the earlier discussion
that search operation is a single step process. However, write is a two-step process
for write “0” and write “1”, whereas, it is a single-step operation for write “X” as
presented in Table 6.3. During write operation, WEN signal enables PM1 and WL
signals appear at the access transistors (NM1 and NM2). BL and BLb are always
precharged at opposite voltage potentials and hence decide the direction of the current
flowing through the MTJ devices. Based on the direction of current, MTJ devices
get switch from either AP → P state or vice versa as explained earlier in Sect. 6.2.
WL1 and WL2 determines the order of access for MTJ devices programming in the
ReTCAM cell. Write energy and latency are found to be the same for a bit level cell
writing from state “0” to “1” and vice versa.

6.3.2 Variability Analysis of ReTCAM

Variability of MTJ devices can affect the performance of ReTCAM in different


ways. Mainly, two kinds of variabilities has been observed in MTJ devices, i.e.,
extrinsic variability (device to device) and intrinsic variability (cycle to cycle). We
have summarized below the impact of each on ReTCAM operation below.

6.3.2.1 Impact of MTJ Extrinsic (Device to Device) Variability on


ReTCAM Search Operation

Extrinsic variability owing to MTJ’s physical and process parameters (mentioned


in Table 6.4), can degrade the search noise margin of the ReTCAM cell. Hence, a
standard deviation of 5% over the mean values of physical parameters (i.e., diameter
of MTJ’s nanopillar, TMR value, and resistance area (RA) product) as shown in
Table 6.4 is included in the compact model. Based on the variability introduced in
the MTJ device, Fig. 6.8a, b shows the ML match and ML mismatch voltage spreads
as well as mismatch window, respectively. From our simulation study, the worstcase
NM degradation of 42% was observed over 1000 ReTCAM cells. However, the
worstcase NM obtained is 182 mV, i.e., practical for sensing purposes. Moreover, in
practical cases where ReTCAM cells for a single ML would be in less number (i.e.,
32, 64, and 128), the NM degradation would not be that high.
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 169

Table 6.4 MTJ’s physical parameters


Parameter Mean value Deviation (%)
Diameter 32 nm 5
TMR 120% 5
RA product 5 5

Fig. 6.8 a ML match and


mismatch voltage windows
due to device-to-device
variation (1000 devices), b
NM best case (312 mV) to
worstcase (182 mV)
variation

6.3.2.2 Impact of MTJ Intrinsic (Cycle to Cycle) Variability on TCAM


Write Operation

Intrinsic variability in MTJ would lead to stochastic/probabilistic switching. In the


used compact model, the stochastic switching is addressed using the gamma distri-
bution and random switching time distribution. It is observed that for a particular
switching success probability rate, the programming pulse voltage and latency are
strongly bounded as shown in Fig. 6.9.
MTJ stochastic switching gets translate into the probabilistic write in the ReT-
CAM cell. The ReTCAM cell write probability can be written as the product of
individual MTJ’s write probabilities, in case when both MTJ devices have to be
written to program a cell in a particular Logic state (6.2).

PCELL = PMTJ 1 ∗ PMTJ 2 (6.2)

If only one MTJ has to be written to program the cell, then the write probability
of the cell is equivalent to the switching probability of a single MTJ. We carried out
simulations, where the ReTCAM cell was written from: Logic “0” → “1”, Logic
170 A. Kumar and M. Suri

Fig. 6.9 Single MTJ device


switching probability
(stochasticity) over 1000
cycles with respect to
duration of programming
pulse at a fixed voltage
amplitude

Fig. 6.10 Proposed


probabilistic write scheme
for LSBs within each
ReTCAM entry

“0”→ “X”, Logic “1” → “X” and their respective vice versa cases. There were six
different writing combinations. These write simulations were performed for 1000
cycles to study the impact of MTJ stochastic switching on the ReTCAM cell. This
helps to derive a write condition for any targeted cell-level error tolerance (ET) while
optimizing the cell write energy and latency. As targeted ET value for an application
increases, the cell energy consumption can be further minimized by operating the
ReTCAM cells in a more probabilistic manner.

6.4 Proposed Low Energy Write Scheme

Several applications (i.e., similarity search, pattern matching, etc.) have a scope of
error/noise tolerance (ET) [11, 29]. For such computational paradigms, we propose
an approximate computing (AC) and low-energy write scheme that exploits the ET.
In the proposed scheme, few LSBs within each ReTCAM entry are written with
probabilistic write (as illustrated in Fig. 6.10). Such probabilistic write of the LSBs
lowers overall energy consumption.
For a multi-bit hybrid ReTCAM, we present a case in which three LSBs are written
in a probabilistic manner. The number of LSBs with probabilistic programming may
change depending on the actual use case. LSBs are written for different cell-level
targeted ET (i.e., 0.1, 0.3, and 3%) using probabilistic write conditions whereas rest
of the bits are considered to be written with deterministic write conditions. The
targeted ET 0.1% is considered as the minimum/reference ET for comparative study
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 171

Fig. 6.11 ReTCAM cell average write energy per bit dependence on cell write probability, voltage
(BL/BLb) and latency. Dashed curves represent energy consumption, whereas, solid curves represent
probability

of write energy and latency. Results obtained using probabilistic “0” → “1” (or vice
versa), and “0” (or “1”) → “X” LSB writes are summarized below.

6.4.1 ReTCAM Probabilistic LSB Write “1”

In this case, we write each LSB cell from Logic “0” to Logic “1” for 1000 cycles.
The impact of MTJ stochastic switching on ReTCAM cell performance parameters
with write voltage scaling (WVS) (1.2 to 0.6 V) is illustrated in Fig. 6.11. Average
write energy per bit and latency for a targeted cell-level ET (i.e., 0.1%) are shown
in Table 6.5. Clearly from Fig. 6.11 and Table 6.5, the energy consumption reduces
at the cost of latency with WVS. To further investigate, we simulated the cell for
different values of ET (making the MTJ switching more probabilistic). We achieved
the best trade-off between cell write voltage and both write energy and latency at 1 V.
Voltage drop (i.e., Vpulse ) across the MTJ devices in the cell was ≈ 2Vc at 1 V write
voltage. From Fig. 6.4c, the MTJ programming voltage (around 2Vc) is the minimum
voltage at which skewness curve is almost flat. Switching distribution curve (inset
of Fig. 6.4c) is nearly symmetric at 2Vc. The performance improvement per LSB
with different ET is illustrated in Table 6.6. Higher ET values (0.3 and 3%) lead to
a further improvement in the energy and latency.

6.4.2 ReTCAM Probabilistic LSB Write “X”

Write “X” is a single-step process because only one MTJ needs to be switched into
high resistance state irrespective of the previously stored Logic state “0” or “1” in the
cell. Thus, Write “X” is relatively faster operation. Writing Logic “X” in the cell also
172 A. Kumar and M. Suri

Table 6.5 Performance parameters for Write “0”→“1”/“1”→“0”


Cell write voltage (V) Targeted ET (%) Avg. write energy (fJ) Write latency (ns)
1.2 0.1 866.6 7.64
1.0 0.1 782.0 9.02
0.8 0.1 672.2 12.4
0.6 0.1 580.0 17.7

Table 6.6 Performance improvement (maximum) for Write “0”→“1”/“1”→“0”


Cell write Targeted ET Avg. l (fJ) Energy l Write latency Latency
voltage (V) (%) (ns) improvement
factor
1.0 0.3 735.3 1.06 8.35 1.08
1.0 3 639.7 1.22 7.14 1.26

Fig. 6.12 ReTCAM cell average write energy dependence on cell writing probability, write voltage
(BL/BLb), and write latency. Dashed curves represent energy, whereas, solid curves represent
probability

increases the matching sequence irrespective of search bit because, for such cells,
the search result will always match. Thus, some LSBs of ReTCAM entries can be
written with “X” (using probabilistic write) without significantly affecting the search
operation correctness. The impact of probabilistic write in ReTCAM with voltage
scaling from 1.2 to 0.6 V is shown in Fig. 6.12. Average write energy per bit, write
latency, and their improvement compared to Write “1” is shown in Table 6.7 (at fixed
ET percentage). It was found that writing “X” at 1 V showed better performance
compared to Write “1” for identical ET. Similarly, Table 6.8 illustrates the average
write energy per bit, latency, and their improvement factor compared to Write “1”.
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 173

Table 6.7 Performance improvement for Write “0”→ “X”/“1”→“X” compared to


“0”→“1”/“1”→“0”
Cell write Targeted ET Average write Energy Write latency Latency
voltage (V) (%) energy (fJ) improvement (ns) improvement
factor factor
1.2 0.1 474.1 1.8 4.73 1.6
1.0 0.1 410.7 1.9 5.08 1.73
0.8 0.1 387.2 1.7 7.57 1.63
0.6 0.1 324.4 1.78 9.81 1.8

Table 6.8 Performance improvement for Write “0”→“X”/“1”→“X” compared to


“0”→“1”/“1”→“0”
Cell write Targeted ET Write energy Energy Latency (ns) Latency
voltage (V) (%) (fJ) improvement improvement
factor factor
1.0 0.3 369.8 1.98 4.49 1.86
1.0 3 305.8 2.09 3.75 1.90

6.5 Discussion

Write voltage scaling (WVS) was found to reduce energy, but at the cost of increased
latency, irrespective of the Logic state (“0” or “1” or “X”) written in LSB at a fixed
ET. However, LSB can be written with probabilistic write for AC applications, thus
optimizing the write energy-latency performance. Moreover, by writing the Logic
“X” using probabilistic write in LSBs, we further achieved higher performance of the
cell when compared with writing Logic “0”→“1”/“1”→“0” as illustrated in Table
6.8. Figure 6.13 shows that the overall improvement in average write energy per bit
and latency by the factor of 2.83x and 1.99x, respectively. Such improvements are
achieved using probabilistic write of “X” in LSBs (for ET = 3%) at write voltage 1
V, compared to Write “1” (for ET = 0.1%) at 1.2 V (equal to VDD). Simultaneously,
it also improves the search and write accuracy in terms of distance error (DE). The
DE is defined as the difference (in decimal value) between targeted write number
and the number which has been written after using probabilistic write. We observed
a trade-off between write energy per bit and DE.
Tables 6.9 and 6.10 show the achieved DE when the low probabilistic (for ET
as 3%) write scheme is used for 3 bits (LSB) while writing Logic “1” and “X”,
respectively. The worstcase DE of 0.21 is observed when 3 LSBs are written from
Logic state “000” to “111”. Worstcase DE with Write “X” ( 0.105), is half of the
worst case DE observed for Write “1”. Thus, Write “X” gives the better trade-off
between DE and write energy per bit, compared to Write “1”. This DE or inaccuracy
from the targeted number takes advantage of ET in AC applications.
174 A. Kumar and M. Suri

Fig. 6.13 Overall write


energy and latency
improvement per LSB using
probabilistic write for two
extreme write cases

Table 6.9 Distance error (DE) due to probabilistic Write “0”→“1” of 3 LSBs for ET = 3% (1000
write cycles)
Targeted 3-LSB write Average written number Distance error (DE) from
operation Initial state:000 targeted number
(Binary/Decimal)
001/1 0.970 0.03
010/2 1.94 0.06
011/3 2.91 0.09
100/4 3.88 0.12
101/5 4.85 0.15
110/6 5.82 0.18
111/7 6.79 0.21

Table 6.10 Distance error (DE) due to probabilistic Write “0”→“X” of 3 LSBs for ET = 3% (1000
write cycles)
Initial state of 3 LSB (Binary) Targeted 3-LSB write Distance error from targeted
write
000 00X 0.015
000 0X0 0.03
000 0XX 0.045
000 X00 0.06
000 X0X 0.075
000 XX0 0.09
000 XXX 0.105
6 Optimized Programming for STT-MTJ-Based TCAM for Low-Energy … 175

6.6 Conclusion

In this chapter, a fast and energy-efficient write scheme for hybrid ReTCAM is
presented which can be used in AC paradigms. The presented scheme exploited
the MTJ’s intrinsic stochastic switching coupled with the merit of writing Logic
don’t-care state for different levels of ET (i.e., 0.1, 0.3, and 3%). A device level
correlation between the optimum write voltage for the overall TCAM performance
was discussed in detail. The link of average switching time and skewness of switching
probability for the MTJ device to ReTCAM cell ensured the optimum programming.
The average write energy, as well as, write latency per bit got to improve by the
factor of 2.83x and 1.99x, respectively, while keeping a 3% ET. Worstcase write
inaccuracies for probabilistic write of LSBs is also decreased by a factor 2x by using
specific don’t care-based scheme. In AC paradigm using ReTCAM (n-bit), the LSBs
can be written with stochastic write conditions to significantly improve the write
performance without a significant drop in the match accuracy.

References

1. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, A.H. Byers, Big


Data: the next frontier for innovation, competition, and productivity (2011). Available via
DIALOG. https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-
data-the-next-frontier-for-innovation
2. C.P. Chen, C.Y. Zhang, Data-intensive applications, challenges, techniques and technologies:
a survey on Big Data. Inf. Sci. 275, 314–347 (2014)
3. C. Perera, A. Zaslavsky, P. Christen, D. Georgakopoulos, Context aware computing for the
internet of things: a survey. IEEE Commun. Surv. Tutor. 16(1), 414–454 (2014)
4. J. Potter, J. Baker, S. Scott, A. Bansal, C. Leangsuksun, C. Asthagiri, ASC: an associative-
computing paradigm. IEEE Comput. 27(11), 19–25 (1994)
5. K. Pagiamtzis, A. Sheikholeslami, Content-addressable memory (CAM) circuits and architec-
tures: a tutorial and survey. IEEE J. Solid-State Circuits 41(3), 712–727 (2006)
6. Q. Guo, X. Guo, Y. Bai, R. Patel, E. Ipek, E.G. Friedman, Resistive ternary content addressable
memory systems for data-intensive computing. IEEE Micro. 35(5), 62–71 (September–October
2015)
7. R. Karam, R. Puri, S. Ghosh, S. Bhunia, Emerging trends in design and applications of memory-
based computing and content-addressable memories. Proc. IEEE 103(8), 1311–1330 (2015)
8. A. Bremler-Barr, Y. Harchol, D. Hay, Y. Hel-Or, Ultra-fast similarity search using ternary
content addressable memory, in 11th International Workshop on Data Management on New
Hardware (ACM, 2015)
9. F. Yu, R.H. Katz, T.V. Lakshman, Gigabit rate packet pattern matching using TCAM, in 12th
IEEE International Conference on Network Protocols (2004), pp. 174–183
10. F. Yu, T.V. Lakshman, M.A. Motoyama, R.H. Katz, SSA: a power and memory efficient scheme
to multi-match packet classification, in ACM Symposium on Architecture for Networking and
Communications Systems (ACM, 2005), pp. 105–113
11. M. Imani, Y. Kim, A. Rahimi, T. Rosing, ACAM: Approximate computing based on adaptive
associative memory with online learning, in International Symposium on Low Power Electron-
ics and Design (ISLPED) (August 2016), pp. 162–167
176 A. Kumar and M. Suri

12. A. Kumar, M. Suri, V. Parmar, N. Locatelli, D. Querlioz, An energy-efficient hybrid (CMOS-


MTJ) TCAM using stochastic writes for approximate computing, in Non-Volatile Memory
Technology Symposium (NVMTS) (2016), pp. 1–5
13. W.S. Zhao, Y. Zhang, T. Devolder, J.O. Klein, D. Ravelosona, C. Chappert, P. Mazoyer, Failure
and Reliability Analysis of STT-MRAM Microelectronics Reliability (Elsevier, 2012)
14. A. Vincent, N. Locatelli, J.-O. Klein, W. Zhao, S. Galdin-Retailleau, D. Querlioz, Analytical
macrospin modeling of the stochastic switching time of spin-transfer torque devices. IEEE
Trans. Electron Devices 62(1), 164–170 (2015)
15. E. Spitznagel, D. Taylor, J. Turner, Packet classification using extended TCAMs, in Proceedings
of 11th IEEE International Conference on Network Protocols, 2003, Atlanta, GA, USA (2003),
pp. 120–131
16. P. Gupta, N. McKeown, Algorithms for packet classification. IEEE Netw. 15(2), 24–32 (March–
April 2001)
17. I. Arsovski, T. Chandler, A. Sheikholeslami, A ternary content addressable memory (TCAM)
based on 4T static storage and including a current-race sensing scheme. IEEE J. Solid-State
Circuits 38(1), 155–158 (2003)
18. H. Noda et al., A cost-efficient high-performance dynamic TCAM with pipelined hierarchical
searching and shift redundancy architecture. IEEE J. Solid-State Circuits 40(1), 245–253 (2005)
19. K. Kim, G. Jeong, Memory technologies for sub-40nm Node, in IEEE International Electron
Devices Meeting (2007), pp. 27–30
20. K. Lakshminarayanan, A. Rangarajan S. Venkatachary, Algorithms for advanced packet clas-
sification with ternary CAMs, in SIGCOMM Computer Communication Review (2005), pp.
193–204
21. W. Xu, T. Zhang, Y. Chen, Design of spin-torque transfer magnetoresistive RAM and
CAM/TCAM with high sensing and search speed. IEEE Trans. Very Large Scale Integr. (VLSI)
Syst. 18(1), 66–74 (2010)
22. N. Onizawa, S. Matsunaga, T. Hanyu, A compact soft-error tolerant asynchronous TCAM based
on a transistor/magnetic-tunnel-junction hybrid dual-rail word structure, in IEEE International
Symposium on Asynchronous Circuits and Systems (ASYNC) (2014), pp. 1–8
23. S. Matsunaga, A. Katsumata, M. Natsui, S. Fukami, T. Endoh, H. Ohno, T. Hanyu, Fully parallel
6T-2MTJ nonvolatile TCAM with single transistor-based self match-line discharge control, in
Symposium on VLSI Circuits Digest of Technical Papers (2011)
24. S. Matsunaga, S. Miura, H. Honjou, K. Kinoshita, S. Ikeda, T. Endoh, H. Ohno, T. Hanyu, A 3.14
um2 4T-2MTJ-cell fully parallel TCAM based on nonvolatile logic-in-memory architecture,
in Symposium on VLSI Circuits (VLSIC) (2012), pp. 44–45
25. H. Meng, J.-P. Wang, Spin transfer in nanomagnetic devices with perpendicular anisotropy.
Appl. Phys. Lett. 88, 172506 (2006)
26. S. Ikeda et al., A perpendicular-anisotropy CoFeBMgO magnetic tunnel junction. Nat. Mater.
9(9), 721–724 (2010)
27. W. Zhao, C. Chappert, V. Javerliac, J.P. Noziere, High speed, high stability and low power
sensing amplifier for MTJ/CMOS hybrid logic circuits. IEEE Trans. Magn. 45(10), 3784–
3787 (2009)
28. A. Kumar, S. Sahay, M. Suri, Switching-time dependent PUF using STT-MRAM, in 2018
31st International Conference on VLSI Design and 2018 17th International Conference on
Embedded Systems (VLSID) (2018), pp. 434–438
29. J. Han, M. Orshansky, Approximate computing: an emerging paradigm for energy-efficient
design, in 18th IEEE European Test Symposium (ETS) (2013), pp. 1–6
Chapter 7
Greedy Edge-Wise Training of Resistive
Switch Arrays

Doo Seok Jeong

Abstract A technical challenge of machine learning based on artificial neural net-


work is large-scale multiply-accumulate (MAC) operation that is costly. The larger
the network, the more MAC operations are required for inference as well as training.
As an alternative to this conventional digital MAC operation, a resistive switching
memory array realizes analog MAC operations in a fully parallel manner. Algorithms
for training such an array are mainly taken from conventional machine learning algo-
rithms such as backpropagation. They are also customized such that they are suitably
implemented on the array. In this chapter, we address such customized machine learn-
ing algorithm as well as new algorithms that are barely based on the conventional
machine learning. A particular focus will be placed on a recently proposed greedy
edge-wise training algorithm that is suitable for resistive switching memory arrays.

7.1 Introduction

Resistive switching memory has two distinct resistance states, high resistance
state (HRS) and low resistance state (LRS), which represent binary “0” and “1”
states [1, 2]. Ideally, once written, the state is maintained unless a voltage larger
than a threshold for switching is applied. One of the great advantages of resis-
tive switching memory lies in its scalability in comparison with capacitance-
based memory such as dynamic random access memory (RAM). That is, scal-
ing down resistive memory cells does not obviously cause significantly com-
plexity in cell design and its fabrication process as opposed to the capacitors
of dynamic RAM. A crossbar array (CBA) of such resistive switching memory
cells with selectors enables access to an arbitrary cell, realizing nonvolatile RAM.
This nonvolatile RAM is popularly referred to as resistive RAM (ReRAM for
short).

D. S. Jeong (B)
Division of Materials Science and Engineering, Hanyang University, Wangsimni-ro 222,
Seongdong-ku, 04763 Seoul, Republic of Korea
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2020 177


M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_7
178 D. S. Jeong

Three conceivable strategies for ReRAM architecture are (i) CBA of cells with
active selectors (transistors), (ii) CBA with passive selectors (e.g., diode and ovonic
threshold switches [3–5]), and (iii) CBA without selector. Hereafter, the first two
are termed as active CBA and passive CBA, respectively. A fascinating attribute of
CBAs is an inherent occurrence of analog multiply-accumulate (MAC) operation
during the simultaneous application of reading voltages to the lines. The operation
is fully parallel such that the minimum complexity of MAC operation O(1) is ideally
achieved. Moreover, with real-valued conductance of each cell in the array, a matrix
with real-valued entries and its multiplication are possibly realized as follows:
⎛ ⎞ ⎛ ⎞⎛ ⎞
i1 g11 g12 · · · g1M v1
⎜ i2 ⎟ ⎜ g21 g22 · · · g2M ⎟ ⎜ v2 ⎟
⎜ ⎟ ⎜ ⎟⎜ ⎟
⎜ . ⎟=⎜ . .. .. .. ⎟⎜ .. ⎟, (7.1)
⎝ .. ⎠ ⎝ .. . . . ⎠⎝ . ⎠
iN gN 1 gN 2 · · · gN M vN

where vm , in , and gnm denote a voltage applied to the mth column, resulting current
measured at the nth row, and conductance of the cell between the mth column and
nth row, respectively. In this case, gnm takes a real value. Yet, the sneak current due
to the lack of selectors likely hinders correct operation given high line resistance
as a consequence of scaling down. The passive CBA also enables analog MAC
operation with the minimum complexity O(1). A remarkable advantage is that the
deployed passive selectors suppress sneak current so that the reliability of MAC
operation is likely improved relative to the CBA case. The selectors barely cause
an addition footprint as far as each selector and a memory cell are stacked. Yet, the
use of selectors may not be compatible with memory of real-valued conductance
given the contribution of the selector resistance to the total conductance. Instead, the
passive CBA is most compatible with binary resistive memory. The active CBA offers
an easier solution to the sneak current problem and compatibility with memory of
real-valued conductance. However, a higher operation complexity [>O(1)] and large
footprint of the transistors undermine the efficiency in MAC operation.
Such MAC operation is the heart of machine learning based on artificial neural
networks. An artificial neural network is a nonlinear hypothesis whose nonlinearity
arises from the activation functions that are referred to as neurons. Other than these
nonlinear activation functions, the whole calculation is linear in that the input into
neuron A is merely the weighted sum of outputs from the neurons in contact with
neuron A. This relation for the simple network in Fig. 7.1 is described by
⎛ ⎞ ⎛ ⎞⎛ L−1 ⎞
z 1L w11 w12 · · · w1M a1
⎜ zL ⎟ ⎜ w21 w22 · · · w2M ⎟ ⎜ a L−1 ⎟
⎜ 2 ⎟ ⎜ ⎟⎜ 2 ⎟
⎜ . ⎟=⎜ . .. .. .. ⎟⎜ .. ⎟, (7.2)
⎝ .. ⎠ ⎝ .. . . . ⎠⎝ . ⎠
L
zN wN 1 wN 2 · · · wN M aML−1
7 Greedy Edge-Wise Training of Resistive Switch Arrays 179

Fig. 7.1 Toy neural network. The circles denote activation functions (neurons)

where amL−1 , z nL , and wnm are the output of neuron m in layer L − 1, input to neuron n
in layer L, and the connection weight between neuron m and neuron n, respectively.
Note that for simplicity bias array is omitted. Its similarity to (1) is easily noticed,
rendering it possible to apply any types of CBAs to neural network calculation, which
potentially offers energy and time efficiency.
Section 7.1.1 is dedicated to addressing general framework of learning in machine
learning based on artificial neural network. Section 7.1.2 is dedicated to general
strategies for training CBAs for supervised learning. Section 7.2 addresses a recently
proposed greedy edge-wise training method suitable for on-chip learning.

7.1.1 Learning and Network Architecture

By and large, learning can be classified as discriminative and generative learning.


Learning in both cases indicates optimization of model (neural network) parame-
ters such as weight plus bias with training dataset. Discriminative learning aims to
train a network such that the network can merely capture the differences among the
input examples that are tagged with own labels. That is, the model does not learn
the structure of input examples of the same label. For instance, when the model is
trained with handwritten digits from 0 to 9 (10 labels in total), as a consequence
of full training, the model can distinguish examples in one label from those in the
others. Yet, the model cannot find the structure of the examples in the label, which
categorizes them as the particular label. Thus, discriminative learning is suitable for
classification, i.e., supervised learning. Irrespective of learning framework, learning
should be distinguished from inference since inference does not tweak model param-
eters unlike learning. Inference is a process that evaluates the response of the output
neurons to given input data. Thus, task success is evaluated by inference.
The feed-forward neural network (sketched in Fig. 7.2a) is well suited to discrim-
inative learning. As the name indicates, during inference, input data unidirectionally
180 D. S. Jeong

Fig. 7.2 a Schematic of a feed-forward neural network including hidden layers. b A schematic of
an RBM

flow to the output layer. The feed-forward neural network varies in architecture
including the number of layers and connection configuration. The simplest archi-
tecture may be a single-layer neural network (perceptron) that consists of merely
input and output layer. The total N neurons in the output layer are fully connected to
the total M input neurons. Therefore, this network involves a single N × M weight
matrix. A multi-layer perceptron includes hidden layers between the input and out-
put layers, which improve classification accuracy by resulting in nonlinear decision
boundaries [6]. Each additional hidden layer needs one additional weight matrix. For
instance, a feed-forward neural network including L hidden layers involves L + 1
weight matrices, which creates a more workload. The convolutional neural network
(CNN) is another type of feed-forward neural network that has sparse and localized
connections as opposed to the perceptron [6, 7]. Such neural networks with hidden
layers are classified as deep neural network (DNN).
Generative learning can capture the structure of input data unlike discriminative
learning [8]. Instead, generative learning itself cannot endow a model with a classi-
fication function. Generative learning enables input data to be mapped onto a new
data space with different bases from the input space, which contrasts features of one
“implicit” class with the others. The network architecture for generative learning
obviously differs from the feed-forward neural network. The restricted Boltzmann
machine (RBM) is a typical example of generative learning suitable network architec-
ture [9]. An RBM is a probabilistic neural network that consists of visible and hidden
layers as illustrated in Fig. 7.3b. Each neuron in both layers serves as a feature or a
dimension. Akin to the feed-forward neural network, the RBM takes weight and bias
as model parameters. Input data are mapped onto the output layer depending upon
the model parameters so that the input data are described by different features (neu-
rons) in the output layer. If the output layer includes fewer neurons than the input, the
mapping implies a reduction in dimension. This case is referred to as dimensionality
reduction. Once input data are mapped onto the hidden layer throughout the weight
7 Greedy Edge-Wise Training of Resistive Switch Arrays 181

Fig. 7.3 a Schematic of a feed-forward neural network including hidden layers. b The procedures
of inference and backpropagation (training) for the feed-forward neural network

matrix and bias array, the data can be remapped onto the visible layer to recover
the original input data (autoencoding). An RBM is trained in such a way to increase
the equivalence between original input data and recovered (reconstructed) data. This
way also enhances the equivalence between arbitrary input data and those in the
hidden layer. The RBM, therefore, needs bidirectional data flow through the same
edges (connections), which obviously differs from the feed-forward neural network.
Though a unit RBM merely consists of two layers (Fig. 7.2b), multiple unit RBMs
can be stacked for repeated changes in dimension through the unit RBMs. Such a
network is referred to as a deep belief network (DBM) [10]. Such a DBM is trained
in a greedy layer-wise manner in that the first RBM unit from the input visible layer
is first fully trained and the following units are subsequently trained until the last
RBM unit [10].

7.1.2 Backpropagation in Feed-Forward Neural Networks


and Strategies for Training CBAs

Backpropagation is a commonly applied algorithm to train a feed-forward neural


network for supervised learning. Given that each input data have an explicit label,
one can evaluate the difference between the desired (correct) and actual outputs. This
difference is termed error or cost so that this difference as a function of input data is
referred to as a cost function. The goal of training is quantitatively straightforward,
which is to minimize the cost by tweaking the model parameters. Assume a feed-
forward neural network with L hidden layers. The network involves L + 1 weight
matrices; the (L + 1)th matrix wL+1 is for the connection between the Lth hidden
layer and output layer. As such, the cost for given data can first be evaluated from the
182 D. S. Jeong

output layer, and consequently, matrix wL+1 and bias array bL+1 are first optimized.
Subsequently, the cost for the Lth hidden layer (i.e., the difference between the
desired and actual outputs of the Lth hidden layer) is evaluated to modify matrix wL
and bias array bL . The desired output of the hidden layer is acquired from the cost
of the output layer so that the error propagates from the output to the lower layer.
This parameter update continues until w1 and bias array b1 . That is, the sequence
of parameter updates is from the output to input layer as opposed to inference.
This training process is termed backpropagation. Schematics of backpropagation
and inference are illustrated in Fig. 7.3.
Backpropagation or its modified (often simplified) algorithm is often used to train
a CBA [11–14]. It is often assumed that the CBA represents real-valued conductance.
A simple way (delta rule) is to first evaluate the error from the output layer [11–13].
It is often assumed that the CBA represents real-valued conductance. This error
determines the sign of a weight (conductance) change. A more complicated way
is to evaluate the error also from the output layer and accordingly program the
desired conductance in an iterative manner with conductance verification [11, 14].
As such, this algorithm requires error-evaluation and write-evaluation circuits so that
the consequent circuit overhead may outweigh the benefits from the efficient MAC
operation.

7.2 Markov Chain Hebbian Learning Algorithm

The Markov chain Hebbian learning (MCHL) algorithm [15] opens the way for train-
ing a CBA for supervised learning without a cost function. Instead of optimization of
the whole model looking up the error, the MCHL algorithm enables a local learning
rule defined between a pair of neurons to eventually optimize the model parameters
as a whole. Because, each edge between a pair of neurons is updated without global
function for the whole network, the MCHL is classified as a greedy edge-wise train-
ing algorithm. That is, adjusting each edge is believed to lead the energy of the entire
system to its minimum. Another significant feature of the MCHL algorithm is the
use of ternary weight w[i, j] ∈ {−1, 0, 1} not only for inference but also for training.
This distinguished this algorithm from binarizing real-valued weight at each update
step [16] as well as the use of auxiliary real-valued variables [17]. Other important
features are as follows:
(i) each weight is updated in a probabilistic manner,
(ii) given the finite states of weight and probabilistic update among the states, the
update process follows a finite-state Markov chain,
(iii) a group of neurons associatively represent a particular label, which is in line
with a concept cell [18],
(iv) a deep network is trained in a greedy layer-wise fashion.
To date, stochastic update of binary weight has been addressed in the frame-
work of stochastic Hebbian learning that accounts for the long-term potentiation
7 Greedy Edge-Wise Training of Resistive Switch Arrays 183

(LTP; Hebbian learning) and long-term depression (LTD; anti-Hebbian learning) of


binary synapses [17, 19]. These examples reported successful learning using such
stochastic Hebbian learning to the degree comparable to its deterministic synapse
with real-valued weight. Yet, they did not offer the network architecture and method
for supervised learning.

7.2.1 Network Structure and Energy

A unit network for the MCHL algorithm is analogous to an RBM. The unit network
has two layers of binary stochastic neurons without recurrent connection. However,
the main difference is this unit network is a feed-forward network so that the hidden
layer in the RBM is replaced by the output layer. This output layer does not feed
input into the input layer unlike the RBM. A schematic of a unit network including
M input features and N output neurons is illustrated in Fig. 7.4a. u1 and u2 mean
input and output arrays, respectively, each of which is, respectively, defined as

u1 ∈ R M , 0 ≤ u 1 [i] ≤ 1
u2 ∈ Z N , u 2 [i] ∈ {0, 1}

As such, H neurons in the output layer are grouped to associatively represent each
of total L labels so that N is equal to LH. Hereafter, such a group is referred to as a
bucket. When the L labels are indexed from 1 to L, u2 [(n − 1)H + 1:nH] is a block
of output activities for the nth label. Note that x[a:b] denotes a block ranging from
the ath to bth elements of vector x.
Weight matrix w is, therefore, an N × M matrix that defines the strength in
connectivity between a pair of neurons. As such, each entry of w takes one of the
ternary values, −1, 0, and 1. According to the bucket configuration in the output
layer, the weight matrix is also partitioned such that w[(n − 1)H + 1: nH, 1: M] is
for the connection from the input vector to the output neurons of the nth label.
The energy E of the network is given by

T

E(u1 , u2 ) = − 2u2 − 1 · w · u 1 + bT · u 2 , (7.3)



where 1 is a N-long vector filled with ones. b denotes a bias vector for the output


layer. 2u2 − 1 in (3) transforms u2 such that a quiet neuron (u2 [i] = 0) is given
an output of −1 rather than zero. This counts the cost of a positive connection (w[i,
j] = 1) between a nonzero input (u1 [j] = 0) and output neuron in an undesired label
(u2 [i] = 0) in supervised learning. This undesired connection raises the energy by
, u2 ) = e−E(u1 ,u2 )/τ /Z ,
u1 [j]. The joint probability distribution of u1 and u2 is P(u1
M
where Z is the partition function of the network, Z = j=1 i=1 N
e−E(u 1 [ j],u 2 [i])/τ . τ
184 D. S. Jeong

Fig. 7.4 a Basic network of M input and N output binary stochastic neurons (u1 and u2 : their
activity vectors). b Behavior
of P(u2 [i] = 1) with z[i] when a[i] = 0. c Graphical description of
the weight matrix w w ∈ Z N ×M ; w[i, j] ∈ {−1, 0, 1} that determines the correlation between

the input activity u1 u1 ∈ R M ; 0 ≤ u 1 [i] ≤ 1 and output activity u2 u2 ∈ Z N ; u 2 [i] ∈ {0, 1} .
This
weight matrix w evolves in accordance to given pairs of an input u1 and write vector
v v ∈ Z N ; v[i] ∈ {−1, 1} , ascertaining the statistical correlation between u1 and v by following
the sub-updates. d Potentiation: a weight component at the current step t (wt [i, j]) has a nonzero
probability to gain +1 (i.e., wt [i, j] = 1) only if u1 [j] = 0, v[i] = 1, and wt [i, j] = 1; for instance,
given u1 = (0, 1, 0, …, 0) and v = (1, −1, −1, …, −1), wt [1, 2] has a probability of positive update.
e Depression: all components wt [i, 2] (i = 1) are probabilistically subject to negative update (gain
−1) insofar as u1 [2] = 1, v[i] = −1, and wt [i, 2] = −1

denotes a temperature parameter. Therefore, the conditional probability of u2 given


u1 is
N M

i=1 −a[i]u 2 [i]− Mj=1 w[i, j]u 1 [ j]+2 j=1 u 2 [i]w[i, j]u 1 [ j] /τ
e
P( u2 |u1 ) = M
=
N −a[i]u 2 [i]− Mj=1 w[i, j]u 1 [ j]+2 j=1 u 2 [i]w[i, j]u 1 [ j] /τ
i=1 u 2 [i]∈{0,1} e
M

−a[i]u 2 [i]+2 j=1 u 2 [i]w[i, j]u 1 [ j] /τ


e
Πi=1
N
M
(7.4)
−a[i]+2 w[i, j]u 1 [ j] /τ
1+e j=1

Due to the lack of recurrent connections, (4) is simplified to


7 Greedy Edge-Wise Training of Resistive Switch Arrays 185


N
(u2 |u1 ) = P(u 2 [i]|u1 ).
i=1

Therefore, the following equation holds:


M

−a[i]+2 j=1 w[i, j]u 1 [ j] /τ


e
P(u 2 [i] = 1|u1 ) = M
. (7.5)
−a[i]+2 w[i, j]u [ j] /τ
1+e j=1 1

Introducing z[i] = Mj=1 w[i, j]u 1 [ j] simplifies (5) to

e(−a[i]+2z[i])/τ 1
P(u 2 [i] = 1|z[i] ) = = . (7.6)
1+e (−a[i]+2z[i])/τ 1+e (a[i]−2z[i])/τ

This probability function for the binary stochastic neuron is plotted in Fig. 7.4b.

7.2.2 Field Application and Update Probability

In the MCHL algorithm, supervision is realized by applying a field that directs input
pattern to its desired label. Directing input is implemented by (a) encouraging its
connection with an output neuron(s) with the desired label among L labels and (b)
discouraging otherwise—both in a probabilistic manner. To this end, a write vector
v that points to the correct label in the L-dimensional space is essential. Each label
is given a bucket of H neurons so that v is an LH-long vector. Given that all labels
are orthogonal to each other, each bucket of v, i.e., v[(n − 1)H + 1: nH]; 1 ≤
n ≤ L, offers each basis of the applied field in the L-dimensional space. v[a:b]
denotes a block ranging from the ath to bth elements. Only one element in each label
bucket of v is randomly chosen for each ad hoc update and given non-zero value
in that the element dedicated to the desired label is set to 1 while the other L − 1
elements to −1. This write vector v is renewed every update. Therefore, the update
is sparse. It is noteworthy that v[i] ∈ {−1, 1} when H = 1 and v[i] ∈ {−1, 0, 1}
otherwise.
Figure 7.4c graphically describes the feed-forward connection between u1 and
u2 for the topology in Fig. 7.4a. The matrix w is loaded with ternary ele-
ments w ∈ Z N ×M ; w[i, j] ∈ {−1, 0, 1} and N = L H . Here the input vector u1 ∈
R M ; 0 ≤ u 1 [i] ≤ 1. Consequently, v ∈ Z N ; v[i] ∈ {−1, 0, 1}. According to the
bucket configuration of the write vector v, the matrix w is partitioned such that w[(n
− 1)H + 1:nH, 1:M] defines the correlation between the input and its label (n).
Likewise, z (= wu1 ) is also partitioned into H buckets in the same order as v, and
the same holds for the output activity vector u2 .
Every pair of u1 and v stochastically updates each component w[i, j] in w by
w[i, j] = wt+1 [i, j] − wt [i, j] ∈ {−1, 0, 1}. The variables determining w[i, j]
186 D. S. Jeong

Table 7.1 Requirements for v[j] 1 −1


the update of nonzero
probability u1 [j] 0 < u1 [j] ≤ 1 0 < u1 [j] ≤ 1
wt [i, j] = 1 = −1
u2 [i] = 1 = 0
P u 1 [ j]P+0 u 1 [ j]P−0

include (a) u1 [j] and v[i], (b) current value of wt [i, j], and (c) output activity u2 [i] as
follows (also see Table 7.1).
Condition (a): it is probable that w[i, j] = 1 when u1 [j]v[i] > 0 (i.e., u1 [j] = 0
and v[i] = 1) and w[i, j] = −1 when u1 [j]v[i] < 0 (i.e., u1 [j] = 0 and v[i] = −1)
conditional upon (b) and (c). That is, w[i, j] is updated to connect the nonzero u1 [j]
and ith output neuron in the desired label (when v[i] = 1) and to disconnect when v[i]
= −1. The former and latter updates are referred to as potentiation and depression,
respectively (Fig. 7.4d, e). This condition is reminiscent of the Hebbian learning such
that w[i, j] is determined by u1 [j]v[i]. The larger the input u1 [j], the more likely the
update is successful such that both P+ (potentiation probability) and P− (depression
probability) scale with u1 [j]; P+ = u 1 [ j]P+0 and P− = u 1 [ j]P−0 , where P+0 and P−0
denote the maximum probability of potentiation and depression, respectively. Such a
negative update is equivalent to homosynaptic long-term depression in the biological
neural network, elucidated by the Bienenstock–Cooper–Munro theory supporting the
spontaneous selectivity development [20, 21].
Condition (b): The updates w[i, j] = 1 and w[i, j] = −1 given Condition (a)
are allowed if the current weight is not 1 (wt [i, j] = 1) and not −1 (wt [i, j] = −1),
respectively. This condition keeps w[i, j] ∈ {−1, 0, 1} so that the update falls into a
finite-state Markov chain.
Condition (c): Alongside Conditions (a) and (b), the updates w[i, j] = 1 and
w[i, j] = −1 require u2 [i] = 0 and u2 [i] = 1, respectively. That is, a quiet output
neuron (u2 [i] = 0) supports w[i, j] = 1, whereas an active one (u2 [i] = 1) supports
w[i, j] = −1.
As a consequence of these update conditions, the MCHL algorithm spontaneously
captures the correlation between input and write vectors (u1 and v) during repeated
Markov processes, which is exemplified in Supplementary Information for randomly
generated input and write vectors that have a statistical correlation.
As such, a learning rate is of significant concern for successful learning; a proper
rate that allows the matrix to converge to the optimized one should be chosen. The
same holds for the MCHL algorithm. The rate in the proposed algorithm is dictated
by P+0 and P−0 in place of an explicit rate term. For extreme cases such as P+0 = 1
and P−0 = 1, the matrix barely converges, but constantly fluctuates.
7 Greedy Edge-Wise Training of Resistive Switch Arrays 187

7.2.3 Handwritten Digit Recognition (Supervised Online


Learning)

The MCHL algorithm can be applied to the handwritten digit recognition task with the
MNIST database (L = 10). Figure 7.5a shows the memory-centric network schematic
for the training, which encompasses one hidden layer. The weight matrices w1 and
w2 were trained in a greedy fashion in that w1 was first fully trained with input
vector u1 and write vector v1 , which was then followed by training matrix w2 with
u2 and v2 . The output vector a1 of the hidden neurons in response to each MNIST
dataset was taken as the input to matrix w2 . The training protocol was the same for
both matrices. For each training epoch, a chosen input dataset (28 × 28 matrix) was
converted to a u1 vector of 784 elements: u1 ∈ R784 ; 0 ≤ u[i] ≤ 1. A bucket of H 1
elements is assigned to each label in the v1 vector such that v1 is a 10 H-long vector
as illustrated in Fig. 7.5a. Every epoch with an input vector u1 randomly chooses one
of H 1 elements (rth element) in the buckets of the correct label; the chosen element
in v is set to 1, the rth elements in the other buckets (9 in total) to −1, and the rest
elements (9H in total) to 0. Therefore, in matrix w1 , the elements in only one row are
probabilistically subject to potentiation, those in the 9 rows to depression, and the rest
are invariant. That is, the update is sparse. Accordingly, matrix w1 is partitioned into
10 sub-matrices (see Fig. 7.5a). The sequence of the MCHL algorithm application
is tabulated in Table 7.1.

Fig. 7.5 a Schematic of the network architecture for handwritten digit recognition. A single HL is
included. The matrix w1 first maps the input vector u1 to the hidden neurons. The probability that
u2 [i] = 1 for all i’s is taken as an input vector to w2 that maps the input vector to the output neurons.
The write vector v1 has 10 (the number of labels) buckets, each of which has H 1 elements, i.e.,
N = 10 H 1 . Each thick arrow indicates an input vector to a group of neurons (each neuron takes
each element in the input vector). b Classification accuracy change in due course of training with
network depth (H 1 = 100, H 2 = 50, H 3 = 30). P+0 , P−0 , and were set to 0.1, 0.1, and 1, respectively
188 D. S. Jeong

The eventual output of the entire network O is a vector of outputs of the whole
labels O ∈ Z10 ; the output of each label O[i] is the activity integration over neurons
2
in label i, O[i] = Hj=1 a2 [(i − 1)H + j] (Fig. 7.5a). The location of the maximum
coefficient in the output vector designates the estimated label for a given input. The
recognition accuracy was evaluated with regard to agreement between the desired
and estimated labels.
The weight matrix becomes larger with the bucket size, so is the memory allocated
for the matrix. Nevertheless, the benefit of deploying buckets at the expense of
memory is twofold. First, a number of features (pixels) are shared among labels
so that an individual feature should not exclusively be directed to a single particular
label. The use of buckets allows such common features to be connected with elements
over different labels given the sparse update on matrix w. For instance, without such
buckets, each attempt to direct the feature at (1, 1)—belonging to both labels 1
and 2—to label 1 probabilistically weakens its connection with label 2; however,
the sparse update perhaps leaves its connection with the other neurons in label 2
invariant. This feature-sharing characteristic is seemingly against competition, and
thus selectivity evolution. However, the use of buckets offers a solution to selectivity
evolution, which is the second benefit. As depicted in Fig. 7.5a, the 10 sub-matrices
in matrix w2 define 10 ensembles of H 2 output neurons; the final output from each
label O[i] is the sum of output over the neurons in the same label, i.e., the output
range scales with H 2 , i.e., 0 – H 2 . As for the training in Fig. 7.1, a single training is
unable to capture the statistical correlation between the input and write vectors due
to a large error; however, the larger the trial numbers, the less likely the statistical
error (noise) is incorporated into the data in line with error reduction in Monte
Carlo simulation with a number of random numbers. The use of buckets enables
the parallel acquisition of samples; therefore, it is conceivable that a larger bucket
size tends to improve recognition accuracy. However, benchmarking Monte Carlo
simulation, the error reduction with sample number tends to be negligible when the
number is sufficiently large. Additionally, the memory cost perhaps outweighs the
negligible improvement on the accuracy. Therefore, it is of practical importance to
reconcile the performance with the memory cost.
The network depth substantially alters the recognition accuracy as plotted in
Fig. 7.5b. Without the hidden layer the accuracy merely reaches approximately 88%
at H = 100 while deploying one HL improves the accuracy up to approximately
92% at H 1 = 100 and H 2 = 50. H 1 and H 2 denote the number of elements in write
vector v for w1 and w2 , respectively. Improvement on accuracy continues onwards
with more HLs (e.g., two HLs; blue curve in Fig. 7.5b) albeit slight in contrast to the
improvement by the first hidden layer.
7 Greedy Edge-Wise Training of Resistive Switch Arrays 189

7.3 Conclusion

A CBA of resistive switching memory cells is a potentially time- and energy-efficient


solution to conventional digital MAC operators, which likely hosts massive neural
networks as hypotheses for extremely complex tasks. When equipped with a CBA-
suitable training algorithm, the CBA-based machine learning technique can move a
large step nearer to its practical applications to machine learning. Yet, albeit useful,
real-valued parameter optimization techniques upon cost functions may not be ulti-
mate algorithms for CBA conductance update given the circuit and power overhead
due to error estimation. Also, technical difficulty in realizing real-valued conduc-
tance arises. Additionally, the CBA without selectors (suitable for real-valued con-
ductance) cannot be free from the sneak current issue that is supposedly brought up
when scaling down. The MCHL algorithm as an alternative to such backpropagation-
based algorithms barely causes technical difficulties in CBAs given that it deals with
a CBA with ternary weight and that the algorithm is a merely local rule that does
not involve any global function such as a cost function [15]. Yet, the recognition
accuracy is slightly below the results acquired from backpropagation algorithms so
that the primitive MCHL algorithm needs to be modified.

References

1. D.S. Jeong, R. Thomas, R.S. Katiyar, J.F. Scott, H. Kohlstedt, A. Petraru, C.S. Hwang, Rep.
Prog. Phys. 75, 076502 (2012)
2. D.S. Jeong, K.M. Kim, S. Kim, B.J. Choi, C.S. Hwang, Adv Elec Mater 2, 1600090 (2016)
3. S.R. Ovshinsky, Phys. Rev. Lett. 21, 1450–1453 (1968)
4. D.S. Jeong, H. Lim, G.-H. Park, C.S. Hwang, S. Lee, Cheong B-k. J. Appl. Phys. 111, 102807
(2012)
5. K. DerChang, S. Tang, I.V. Karpov, R. Dodge, B. Klehn, J.A. Kalb, J. Strand, A. Diaz, N.
Leung, J. Wu, S. Lee, T. Langtry, C. Kuo-wei, C. Papagianni, L. Jinwook, J. Hirst, S. Erra, E.
Flores, N. Righos, H. Castro, Spadini G A stackable cross point Phase Change Memory. IEEE
Intl. Electron Devices Meeting 7–9(2009), 1–4 (2009)
6. Y. LeCun, Y. Bengio, G. Hinton, Nature 521, 436–444 (2015)
7. Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Proc. IEEE 86, 2278–2324 (1998)
8. D. Barber, Bayesian Reasoning and Machine Learning (Cambridge University Press, Cam-
bridge, United Kingdom, 2012)
9. G. E. Hinton, A practical guide to training restricted Boltzmann machines, in Neural Networks:
Tricks of the Trade, ed. by G. Montavon, G. B. Orr, K.-R. Müller (Second edn., Springer Berlin
Heidelberg, Berlin, Heidelberg, 2012), pp 599–619. https://doi.org/10.1007/978-3-642-35289-
8_32
10. G.E. Hinton, S. Osindero, Y.-W. Teh, Neural Comput. 18, 1527–1554 (2006)
11. P. Yao, H. Wu, B. Gao, S.B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang, N. Deng, L. Shi, H.S.P.
Wong, H. Qian, Nat Commun 8, 15199 (2017)
12. M. Prezioso, F. Merrikh-Bayat, B.D. Hoskins, G.C. Adam, K.K. Likharev, D.B. Strukov, Nature
521, 61–64 (2015)
13. F. Alibart, E. Zamanidoost, D.B. Strukov, Nat Commun 4, 2072 (2013)
14. M. Hu, C.E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang, R.S. Williams,
J.J. Yang, Q. Xia, J.P. Strachan, Adv. Mater. 30, 1705914 (2018)
190 D. S. Jeong

15. G. Kim, V. Kornijcuk, D. Kim, I. Kim, J. Kim, H.C. Woo, J.H. Kim, C.S. Hwang, D.S. Jeong.
arXiv:171108679 [csNE] (2017)
16. M. Courbariaux, Y. Bengio, J.-P. David. arXiv:151100363 2015
17. C. Baldassi, A. Braunstein, N. Brunel, R. Zecchina, Proc. Natl. Acad. Sci. 104, 11079–11084
(2007)
18. R.Q. Quiroga, Nat. Rev. Neurosci. 13, 587–597 (2012)
19. N. Brunel, F. Carusi, S. Fusi, Network: Computation in Neural Systems 9, 123–152 (1997)
20. E. Bienenstock, L. Cooper, P. Munro, J. Neurosci. 2, 32–48 (1982)
21. L.N. Cooper, M.F. Bear, Nat. Rev. Neurosci. 13, 798–810 (2012)
Chapter 8
mMPU—A Real Processing-in-Memory
Architecture to Combat the von
Neumann Bottleneck

Nishil Talati, Rotem Ben-Hur, Nimrod Wald, Ameer Haj-Ali,


John Reuben and Shahar Kvatinsky

Abstract Data transfer between processing and memory units in modern comput-
ing systems is their main performance and energy-efficiency bottleneck, commonly
known as the von Neumann bottleneck. Prior research attempts to alleviate the prob-
lem by moving the computing units closer to the memory that has had limited success
since data transfer is still required. In this chapter, we present mMPU memristive
memory processing unit, which relies on a memristive memory to perform computa-
tion using the memory cells, and therefore directly tackles the von Neumann bottle-
neck. In mMPU, the operation is controlled by a modified controller and peripheral
circuit without changing the structure of the memory cells and arrays. As the basic
logic element, we present Memristor-Aided loGIC (MAGIC), a technique to com-
pute logical functions using memristors within the memory array. We further show
how to extend basic MAGIC primitives to execute any arbitrary Boolean function
and demonstrate the microarchitecture of the memory. This process is required to

N. Talati (B)
Computer Science and Engineering Department, University Of Michigan, Ann Arbor, Michigan
48105, USA
e-mail: [email protected]
R. Ben-Hur · N. Wald · S. Kvatinsky
Electrical Engineering Department, Technion - Israel Institute of Technology, 3200003 Haifa,
Israel
e-mail: [email protected]
N. Wald
e-mail: [email protected]
S. Kvatinsky
e-mail: [email protected]
A. Haj-Ali
Electrical Engineering and Computer Science Department, University of California, Berkeley,
94720 Berkeley, CA, USA
e-mail: [email protected]
J. Reuben
Lehrstuhl für Informatik 3, Department Informatik (INF), Friedrich-Alexander-University (FAU)
Erlangen-Nürnberg, 91058 Erlangen, Bayem, Germany
e-mail: [email protected]
© Springer Nature Singapore Pte Ltd. 2020 191
M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_8
192 N. Talati et al.

enable data computing using MAGIC. Finally, we show how to build the computing
system using mMPU, which performs computation using MAGIC to enable a real
processing-in-memory machine.

8.1 Introduction

Contemporary general-purpose computing systems use von Neumann architecture,


or an ameliorated version of it, which separates the processing units (or CPUs)
from the memory system. Due to this separation, data has to travel between the
processor and memory through a bandwidth-limited bus, which causes a massive
overhead of performance and energy. This is called the von Neumann bottleneck.
For years, researchers have been attempting to devise possible replacements for this
computation model. Furthermore, with the scaling in the size of the transistor, the
performances of both CPUs and memory have scaled; however, the performance of
the CPU doubles every 2 years, while the performance of the memory doubles every
10 years, as shown in Fig. 8.1b. This is the reason for today’s large performance gap
between CPU and memory. As a result, the processor has to wait for multiple clock
cycles in order to receive data from the memory, which is known as the memory wall.
Some previous approaches to alleviate the von Neumann bottleneck [11, 12, 33,
36] try to move the processing units (PUs) closer to the memory. While doing so, these
methods use the DRAM technology for the memory system. Although DRAM is a
mature and commercial memory technology, conventional DRAM cells, which are
used to store data, are incapable of processing data, and as a consequence, data must
still be transferred to closely placed PUs. Hence, these approaches only alleviate
the von Neumann bottleneck to a limited extent. An attractive way to completely
solve the von Neumann bottleneck is to give computation capabilities directly to the
memory cells, thereby eliminating the need for transferring data.

Fig. 8.1 a Abstract model of von Neumann architecture, where two separate units (CPU and
memory) are dedicated for data processing and data storage. These elements are connected through
a bandwidth(B/W)-limited bus for data transfer [35]. b Performance scaling of CPU and memory
with respect to time
8 mMPU—A Real Processing-in-Memory Architecture … 193

Emerging memory technologies, such as Resistive RAM (RAM), Phase-Change


Memory (PCM), Spin-Transfer Torque Magnetoresistive RAM (STT-RAM), etc.,
are considered to be potential candidates for replacing the conventional memory
technologies, i.e., DRAM and Flash. Unlike conventional memory technologies that
represent data in terms of presence/absence of charge, emerging memories store
the logical value in terms of difference in the value of the cell resistance. Hence,
we collectively call them memristors ( i.e., memory + resistors) [25]. Apart from
data storage, the variable resistance property can also be exploited to employ the
memristor cells directly for data processing, which has the potential to resolve the
von Neumann bottleneck completely.
A memristor is a two-terminal passive circuit element with variable resistance that
can be controlled by applying voltage across it. The resistance of the memristor is
confined between minimum and maximum resistance values, commonly represented
as a low-resistance state (LRS or RON ) and a high-resistance state (HRS or ROFF ).
The execution of various logical functions is carried out by assembling memris-
tors with/without other components in different circuit connections and by applying
different voltages across them [13, 23, 24, 26, 28, 30, 34, 43, 46].
In this chapter, we present the memristive Memory Processing Unit (mMPU),
which directly tackles the von Neumann bottleneck by giving the processing capa-
bilities to the memristive memory elements. We first present Memristor-Aided loGIC
(MAGIC), which is a technique to execute logical operations. Specifically, we present
MAGIC NOR, which is a technique to perform computation within the memristive
memory array, by adding a voltage level to the regular memory operation, and with-
out changing the memristive memory crossbar architecture. The inputs and outputs
of the MAGIC gate are the resistance values of the memristors. Hence, it can be used
to process data already stored within the memory without reading the inputs, and
the output is inherently stored at the desired location inside the memory, obviating
the need for a write operation. Furthermore, the MAGIC NOR execution is nonde-
structive in terms of inputs. Hence, logic execution within the memristive memory
enables a true processing-in-memory (PiM) architecture.
We further show how to extend MAGIC execution from a single gate to multiple
gates in parallel to the implementation of a Single-Instruction Multiple-Data (SIMD)
machine. We describe the microarchitecture of the mMPU that is required to enable
true PiM. Specifically, we show the design of an mMPU controller that receives the
regular read/write as well as processing commands from the CPU. The write instruc-
tion is executed by applying voltage across the memristors through wordlines/bitlines
and the read instruction is executed by applying voltage and measuring the current
through the memristor using a sense amplifier. Processing instructions are broken
down by the mMPU controller into a sequence of MAGIC NOR operations, which
can be performed using the memristors. We also present SIMPLE MAGIC, which
can synthesize any arbitrary Boolean function into a sequence of MAGIC operations,
which can be used within the mMPU controller. Finally, we show the implications of
the system integration of mMPU in two different modes—(a) mMPU as an accelera-
tor and (b) mMPU as a processing unit that is also the system memory. Data-intensive
and massively parallel applications, such as deep learning and image processing,
194 N. Talati et al.

which suffer the most from the von Neumann bottleneck, can be efficiently executed
on the mMPU.

8.2 PiM: Prior Art and Its Impact

Early efforts in investigating PiM date back to the ’90s. Some famous proposals
include a configurable PiM chip that can operate as a conventional memory or as
a Single-Instruction Multiple-Data (SIMD) processor for data processing [12]. The
authors of Active pages [33] have proposed placing the CPU and configurable logic
elements next to the DRAM subarrays to speed up the processing. In Computational
RAM [11], the sense amplifiers of the random access memory are connected directly
to the SIMD pipelines. The Berkeley IRAM project [35, 36] advocated widening the
bandwidth between CPU and memory by designing them on the same die.
Early adaptation of PiM failed to gain widespread adoption because of four major
challenges [1]. The first challenge was inadequate implementation of technology.
Although prior proposals tried to integrate the memory and CPU on the same die,
the incompatible fabrication technologies of DRAM and CPU made it difficult to
incorporate these approaches in practical computing systems. The second was the
processor architecture that can use the high bandwidth enabled by proximity to
memory. Early PiM research required custom architectures, requiring considerable
design efforts and significant advancement in the developer community. The third
challenge was the development of interfaces that allowed PiM computing units as well
as external processing units to access memory. Early efforts required the design and
adoption of custom memory interfaces. The fourth challenge was the programming
models. Early approaches had to develop the programming abstractions from the
bottom up.
Today, the aforementioned challenges are being overcome by modern age with
the advancement in technologies and methodology involved in building computers.
For example, the first challenge has to be overcome by the emergence of 3D die
stacking, enabling heterogeneous integration of logic and memory, and emerging
memory technologies, facilitating 3D fabrication of memory arrays on top of CMOS
substrates [16]. The evolution of various other processing platforms, e.g., GPGPUs,
custom accelerators, etc. have solved the second problem by efficiently utilizing
the high bandwidth offered by the memory within the thermal constraints of the
memory modules [10]. Recent die-stacked memory interface standards (such as High
Bandwidth Memory [22]) and off-chip memory interfaces that expose load-store
semantics (such as Hybrid Memory Cube [21]) meet nearly all the memory interface
requirements of PiM, which surmounts the third challenge. Recent frameworks such
as Heterogeneous System Architecture [2] and the associated software tools for
accelerators have addressed the fourth challenge to widespread adoptation of PiM.
Although the advancement in technologies solve most of the aforementioned
problems, the current state-of-the-art technologies and future PiM proposals should
address the new set of issues such as workload heterogeneity (different algorithms
8 mMPU—A Real Processing-in-Memory Architecture … 195

STE
Mb
STE STE STE STE (State (2n-1)
(0) (1) (2) (y)
Transition Mb

Memory Array
(2n-2)
Element) Mb
Decoder

Mb - Memory bits
Input (2n-3)

Mb (1 X 2n-1)

Mb (1 X 2n-1)

Mb (1 X 2n-1)
(1 X 2n-1)
Symbol
Mb
(1)
Mb
(0)
State Transition STE
Logic Logic Logic Logic STE Enable
Clock Output
Inputs State
Output
Inputs

State Bit
Clock

Automata Routing Matrix

Fig. 8.2 Modern PiM architecture—Micron’s Automata Processor (AP) [9], which exploits the
inherent bit-parallelism in DRAM for symbolic pattern matching by performing multiple operations
on a single data and by that reducing the number of memory accesses

present various memory layouts, access patterns, and involve computations with dif-
ferent degrees of parallelism and complexity) and fabrication challenges in memory
that can enable PiM.
One current state-of-the-art PiM concept is Micron’s Automata Processor (AP)
[9], as shown in Fig. 8.2. The AP natively implements the Nondeterministic Finite
Automata (NFA) paradigm in hardware. Thus, the AP is an accelerator designed
specifically for symbolic pattern matching. In this architecture, the input symbol
is provided to multiple memory arrays by decoding it, instead of the row address.
Automata operations are invoked through a routing matrix structure exploiting the
inherent bit-parallelism of traditional DRAM, enabling Multiple Instruction Sin-
gle Data (MISD) architecture. This architecture provides the flexibility to program
independent automata on a single silicon device [40]. Apart from the AP, several
other recent proposals for PiM enable the transition from DRAM to resistance-based
emerging Non-volatile Memory Technologies (NVRAM). These approaches include
the accelerators for enhancing artificial neural networks [3, 7], DDR3-compatible
interface with dual in-line memory modules (DIMM) capable of performing content
addressable searches [14], associative computing [15, 45], etc.
All of the previous approaches for addressing the von Neumann bottleneck using
PiM have relied on reducing the distance between the processing and the conven-
tional memory system, i.e., DRAM. Although DRAM has been exploited to its best
capabilities, these approaches still suffer from a fundamental problem—the need
to transfer data between the CPU and the memory. Because DRAM cells are inca-
pable of performing logical operations, systems with DRAM as a memory require a
separate resource to perform computation. Emerging memristive technologies, such
as Resistive Random Access Memory (RRAM or ReRAM) [27, 41, 42], enable a
new approach, where the computation of logical functions is done directly using the
196 N. Talati et al.

memory cells, without any need to instantiate additional CMOS blocks for process-
ing. In this chapter, the von Neumann bottleneck is solved by giving computational
capabilities directly to the memristive memory cells. Thus, the proposed approach
is fundamentally different than all the previously proposed techniques in PiM and
tackles the data movement issue directly.

8.3 Computation with Memristors

In this section, we first describe the operation of the memristor crossbar array as
memory. Then, we present Memristor-Aided loGIC (MAGIC), a logic family that
enables the performance of logical operations within the memristive memory. We
further show how to integrate the MAGIC circuit within the memristive memory
array without requiring major modifications in the crossbar structure and techniques
to perform vector operations using MAGIC.

8.3.1 Memristive Memory

The memristor stores the logical value in terms of its resistance, in contrast to conven-
tional memories, which use a charge to represent data. This resistance is controlled
by applying voltage across the memristor. Memristors can be fabricated between
two metals, which act as the top and the bottom electrodes of a switching dielectric
material. Hence, memristors can be fabricated in the metal layers as part of a standard
CMOS Back End of Line (BEOL) process. Memristive memory generally utilizes
a crossbar structure, which enables an extremely dense memory array with mem-
ory cell area of 4F 2 , where F is the technology feature size. Figure 8.3 shows one
such design of a memristive memory crossbar array. Voltage drivers, row/column
decoders, and sense amplifiers are used as a part of the peripheral circuit to sup-
port write and read operations, similar to other memory technologies. To perform a
write operation, a write voltage Vwrite , higher than the threshold voltage (von and voff ,
which switches the memristor to LRS and HRS, respectively), is applied across the
target memristor through the wordlines and bitlines. For a memristor with asymmet-
ric switching characteristics (i.e., von = voff ), two different write voltages are applied
for writing logic 1 (i.e., VSET ) and 0 (i.e., VRESET ). Since during the write operation,
the voltage is applied through wordlines and bitlines, even the memristors adjacent
to the target memristors are partially influenced by this voltage, which may disturb
the state of the unselected memristor; this is known as the write disturb problem
[29]. Half-select voltages (typically Vwrite /2 or Vwrite /3 [6]) are applied to isolate the
nontarget memristors.
Read operations are performed by applying a voltage Vread , with a magnitude
lower than the threshold voltage for switching, and measuring the current passing
through the device using a sense amplifier (SA), as shown in Fig. 8.3. A primary
8 mMPU—A Real Processing-in-Memory Architecture … 197

Fig. 8.3 Crossbar structure of memristive memory array. Voltage controllers and sense amplifiers
are used to perform read, write, and logic operations. Example of a write operation by applying
Vwrite across the target memristors, and a read operation by applying Vread across the memristor and
measuring the current using a sense amplifier. Note that reads and write operations are performed
in time-multiplexed fashion

challenge for the read operation for memristive memory is the sneak path current
phenomenon [4, 31, 38, 47], which is due to the resistive nature of the memory cells:
the read voltage also creates additional current paths, different than the desired path,
and this additional current flow adds resistance in parallel to the selected memristor,
which depends on the stored data in the unselected memristors. There are several
ways to overcome this challenge [4, 17, 47], including modification of the memory
cell structure (i.e., using a diode/transistor/selector in series with the memristor)
and using different biasing schemes for the unselected lines (i.e., ground/half-select
biasing schemes).
Although the memristive memory crossbar structure is symmetrical, accessing
memory cells in a conventional memristive memory array is possible only from one
direction. Access from the other direction is blocked since only specific voltages can
be applied in each row/column, and the decoding and sensing circuits are connected to
a single edge of the array. To enable the access to memory cells from all sides, voltage
controllers and sense amplifiers can be added on both sides of the memristive memory
crossbar, constituting a memory called transpose memory [39]. Additional peripheral
circuitry would provide more flexibility to the memory array and would provide
capabilities to the memory system. Figure 8.4a illustrates the difference in peripheral
circuitry between k × m conventional and transpose memory crossbars. Figure 8.4b
shows the comparison of the ratio of total area utilized at CMOS and memristive layer
for different values of array sizes (i.e., k × k). The comparison shows that the ratio is
almost equal (which implies the area utilization) for large array sizes (i.e., k ≥ 100).
Note that this is a general comparison irrespective of the memristor technology used,
i.e., without considering the maximum allowed array size.
198 N. Talati et al.

(a) (b)

Fig. 8.4 a Comparison of additional supporting CMOS circuitry to facilitate logic implementa-
tion at nanocrossbar layer for k × m conventional and transpose memories, and b Ratio between
CMOS area (ACMOS ) and memristor area (AMEM ) for different array sizes (i.e., different k for k × k
arrays) for conventional and transpose memory crossbars. The area utilization at nanocrossbar layer
improves for larger arrays

All operations (read, write, and half-selecting cells) are performed in transpose
memory by application of similar voltages as in conventional memory, with the added
freedom of applying these voltages from both horizontal and vertical directions.
Furthermore, as described later in Sect. 8.3.2, transpose memory offers the additional
feature of transposing the logic execution in the columns of the array, whereas in
conventional memory, this is only possible over a memory row.

8.3.2 MAGIC—Memristor-Aided loGIC

MAGIC is a stateful logic family [37], compatible for computation within the mem-
ristive memory [26]. In MAGIC, n-input memristors and a single-output memristor
are used to execute n-input Boolean functions (e.g., NOR, NAND, OR, AND, and
NOT). Some MAGIC gates, such as NOR and NOT, can be implemented within the
memristive memory crossbar array, not requiring any modification of the crossbar or
the memory cells. An additional voltage level is required, apart from read and write
voltages, in order to support the MAGIC execution within the memory. Figure 8.5b
shows the schematic of a two-input MAGIC NOR gate, where IN1 and IN2 are the
inputs of the NOR gate, and OU T is the output. The input memristors and the output
memristor are always connected in the reverse polarity as shown in Fig. 8.5b.
To execute the MAGIC NOR operation, the output memristor is initialized to
RON . A voltage V0 , higher than the threshold voltage for switching, is applied to the
input memristors, and the output memristor is grounded from the other terminal as
shown in Fig. 8.5c. Due to the resistive nature of memristors, the voltage is divided
between the input and output memristors. Consequently, the output switches from
RON to ROFF , only if both the inputs are logic 1, i.e., the voltage across the output
8 mMPU—A Real Processing-in-Memory Architecture … 199

Fig. 8.5 a Desired switching characteristic of a memristor, schematic of a b two-input MAGIC


NOR gate and a c MAGIC NOR gate within a memristive memory array. IN1 and IN2 are the input
memristors and OU T is the output memristor. A single voltage V0 is applied to perform the NOR
operation [26]

memristor is high. The value of the MAGIC execution voltage V0 has to be within a
certain interval to ensure that the MAGIC gate works as expected. The value of V0
should be high enough to switch the output memristor during the MAGIC execution,
when all the inputs are logic 1, which sets the lower bound on V0 . Furthermore, the
value of V0 should be sufficiently low to prevent switching of the input memristors.
This sets the higher bound on V0 . Hence, the constraints on an n-input MAGIC NOR
gate execution voltage V0 should be
 
voff  ROFF 
· RON + ||RON < V0 , (8.1)
RON n−1
    
ROFF nRON 
V0 < min voff · 1 + , |von | · 1 + , (8.2)
nRON ROFF

which ensures that the gate executes a NOR operation, and the input data is never
destroyed. Note that the aforementioned constraint is constructed neglecting the
parasitic effects of wires. In a more realistic scenario, where a unit interconnect
resistance of rw is considered between two adjacent wordlines/bitlines, (8.1) becomes
  
voff   ROFF 
· RON + ||RON < V0 , (8.3)
RON n−1
 
ROFF
 + RON
n (R + nRON ) 
V0 < min voff · , |von | · OFF . (8.4)
RON ROFF

where RON and ROFF denote the effective resistances and are equal, respectively, to
(RON + iRw ) and (ROFF + iRw ). Note that these expressions are similar to (8.1, 8.2).
It is possible to further extend the execution of a MAGIC NOR operation from a
memory row to a memory column in the transpose memory [39]. Figure 8.6a shows
the schematic of a MAGIC NOR gate on a memory column. In this case, the MAGIC
execution voltage (V0 ) is applied to the output memristor, and the parallel combina-
tion of the input memristors is grounded from the side, which is not connected to
the output memristor. This is the only difference between them, and the range of V0
200 N. Talati et al.

Fig. 8.6 a MAGIC NOR execution over a memristive memory column. b Attempt to execute two
distinct MAGIC NOR operations over the same row simultaneously, and c its equivalent circuit
schematic, demonstrating the wrong operation

Table 8.1 Steps involved in MAGIC NOR execution across a row (column) of a memristive memory
Step # Operation Application of voltages
1 Initialize output memristor at RON out ← VW RITE
2 Apply V0 to the input (output) in1, in2, . . . ← V0 (GND) and
memristor(s), and ground to the output out ← GND(V0 )
(input) memristor(s) for execution over a
memory row (column)

is the same as in the previous case of NOR logic execution, which is nondestructive
in terms of its inputs. The steps involved in MAGIC execution over both rows and
columns are summarized in Table 8.1.
The parallelism of MAGIC within crossbar arrays is limited; two independent
MAGIC NOR gates cannot be executed simultaneously in the same row, as illustrated
in Fig. 8.6b. If V0 is applied to two different sets of input memristors ({IN11 , IN21 } and
{IN12 , IN22 }), and output memristors ({OU T 1 , OU T 2 }) are grounded, the equivalent
circuit becomes as shown in Fig. 8.6c. Due to the connection pattern between the
input and the output memristors, two output memristors are actually connected in
parallel, leaving the equivalent resistance at the output as RON /2, rather than RON ,
resulting in a wrong operation.

8.3.3 Vector Operation Using MAGIC

While the MAGIC execution voltages are applied to wordlines or bitlines (for trans-
pose MAGIC operation), the influence of these voltages is spread throughout the
whole data line, and not limited to the particular memory row/column. As shown in
Fig. 8.7, if V0 is applied to the first two columns, and the third column is grounded,
8 mMPU—A Real Processing-in-Memory Architecture … 201

(a) V0 GND (b) V0 GND VISO VISO

Parallel Isolated
MAGIC NOR Row
Execution
VISO

Fig. 8.7 a Intrinsic parallel MAGIC NOR execution over for data present in all the rows, and b
isolation of a row using an isolation voltage applied to that row (i.e., VISO ) to prevent execution of
MAGIC NOR

all the memristors situated in the first column perform the MAGIC NOR operation
with its neighboring cell on the second column and produce the output on the corre-
sponding cell in the third column. This situation can be exploited to perform vector
operations [39]. Note that the latency to perform this vector operation is independent
of the size of the vector, as long as the entire vector can fit inside an array, and the
voltage drivers can provide the required currents for proper behavior.
If the vector operation is restricted to few rows in the array, it is possible to
isolate a particular row from the MAGIC execution. This is achieved using isolation
voltages, which are similar to half-select voltages for write operations. While in
write operations, half of the voltage is applied (i.e., Vwrite /2) to prevent the unwanted
logic operations, applying V0 /2 in a MAGIC NOR operation would disturb the input
memristors. Hence, we propose ranges of voltages that can be applied to isolate
rows/columns, thus preventing them from executing a MAGIC NOR operation as
shown in Fig. 8.7b. When a MAGIC operation is performed over the rows, VISO must
fulfill
V0
0 < |VISO | < |voff | < , (8.5)
2
and when a MAGIC operation is performed over columns, VISO should carry out

V0 − |voff | < |VISO | < |von |, (8.6)

where von and voff are the SET and RESET switching thresholds for the memristor and
V0 is the MAGIC execution voltage. The voltage levels that should be supported by the
peripheral circuit in order to perform conventional memory operations and execute
MAGIC logic within the memristive memory are listed in Table 8.2. Figure 8.8 shows
the design of the peripheral circuit needed to support these operations and the voltage
levels inside the memristive memory. Analog multiplexers, as shown in Fig. 8.8b, can
202 N. Talati et al.

Table 8.2 Voltage levels supported by the peripheral circuit to perform conventional memory
operations and execute MAGIC NOR gates within the memory
Operation Voltages applied
Write Vwrite = VSET and VRESET for writing logic 1 and 0
Read Vread
Ground GND
Half-select Vwrite /2
MAGIC execution V0
MAGIC isolation VISO

(a) (b)
V 2 V 4 V6 V8
BL V1 V 3 V 5 V7
log 2k V1 = VSET
V2 = VRESET
V3 = VREAD Memory
Controller

V4 = GND Operation
On-chip

Memristive
Decoder

V5 = VSET/2
Memory
WL

V6 = VRESET/2
V7 = Vctrl1 Logic
V8 = Vctrl2 Operation

SA To
WL/BL
WL = BL = SA = Sense log 2k
Wordlines Bitlines Amplifiers

Fig. 8.8 a Peripheral circuit around memory. b Structure of an analog mux

be designed to assert different voltage levels to support write and MAGIC operations,
and a sense amplifier can be used to perform read operations.

8.3.3.1 Limitations on the Performance of Vector Operations Using


MAGIC

While MAGIC NOR operations can be performed in every row (column) in parallel,
the length of the SIMD that can be implemented within a memristive crossbar is
restricted by the size of the array. The size of the array is further dependent on various
circuit and technological parameters. The circuit parameters crucial for deciding
the size of the array are the MAGIC execution voltage V0 , and the technological
parameters include memristive properties (RON , ROFF , von , and voff ) and parasitic
effects of the CMOS process (i.e., interconnect resistance and capacitance). To be
able to support MAGIC NOR operations in all the rows (columns) of the crossbar,
the MAGIC execution must be supported in the worst-case configuration at the row
(column) farthest from the voltage drivers, since the voltage across it would be the
lowest. Worst-case configuration occurs when all the resistance values in the array
8 mMPU—A Real Processing-in-Memory Architecture … 203

are RON and it is required to execute MAGIC over all the rows (columns). This is
because lower memristor resistance would require higher current to be drawn from
the drivers, and as a consequence, the IR drop across the parasitic resistances would
be high, lowering the voltage drop across the farthest memristor. Hence, given fixed
V0 and other technological parameters, a finite number of MAGIC NOR operations
will be supported, which will limit the size of the memristive crossbar.
Furthermore, to support the execution of multiple MAGIC NOR operations in
parallel, the voltage drivers would require a large current inside the array, which has
two consequences. First, to supply a current large enough to support several MAGIC
NOR operations, the drivers must also be large, which will increase the area of the
chip. Second, since V0 has a higher voltage level than write voltage, performing
many MAGIC NOR operations in parallel will increase the energy consumption.
Hence, while the goal is parallel execution of MAGIC NOR gates, this parallelism
will be limited by the area and power budget of the chip from the point of view of
the peripheral circuit.

8.4 mMPU Microarchitecture

The primary difference between a memristive memory and an mMPU is their con-
trol mechanism. In addition to supporting regular memory operations (i.e., read
and write), the mMPU controller also handles logic operations within the memory,
and in practice its implementation determines the performance of the mMPU. We
now present the modifications that must be made to the on-chip controller of the
mMPU [18]. We further show SIMPLE MAGIC [20], an automatic synthesis tool
we have developed that receives any arbitrary Boolean function as input and proposes
an optimal (in terms of latency, energy, or area) sequence of MAGIC NOR gates to
implement that function using the mMPU.

8.4.1 mMPU Controller

The mMPU controller is responsible for generating the control signals for the memory
to perform read, write, and logical operations within the mMPU. As shown in Fig. 8.9,
the CPU sends the instruction to the mMPU controller. This instruction is received
by a CPU-in block, where it is decoded. Then, this instruction is broadcasted to the
arithmetic, read, and write blocks, and a block suitable for the instruction type is
selected using the memory out mux. For example, if the CPU sends an arithmetic
instruction, the control sequence from the arithmetic block would be selected to be
sent to the memristive memory.
Whereas reads and writes in the mMPU are performed in a conventional way
[18], across the memristor over the target wordlines and bitlines, executing logical
instructions is more complicated since they require a sequence of logical steps. The
arithmetic block is a sophisticated finite state machine, the role of which is to effi-
204 N. Talati et al.

Opcode

Arithmetic WL = Wordlines
BL = Bitlines

Memory out MUX


Block
CPU CPU BL
Instr.
In In
Write
CPU Block
CPU CPU Memristive
Data
In Out
Memory
Read
Block

Data Out

Fig. 8.9 Detailed block diagram of the mMPU controller, where an arithmetic block is added to
support computation within the memristive memory [18]

ciently break the instruction down into a series of MAGIC operations, and to select
the memristive cells to perform the operations within the memory array. For example,
the CPU sends an instruction to add two numbers (i.e., ADD) within the memory.
The instruction is received by the CPU-in block, which identifies the instruction as
ADD and generates the memory out mux select signal. Then, the instruction is sent
to the arithmetic block, where an appropriate, pre-synthesized execution sequence
is selected for this instruction. This execution sequence is then executed on the
memristive memory. The mMPU controller pipelines this operational sequence to
the memory, changing the applied voltages on each memory clock cycle. Efficient
pipelining maximizes the processing efficiency in terms of speed and energy. To opti-
mize the throughput of the arithmetic instruction execution, different considerations
should be taken into account [19], as detailed below.

8.4.1.1 Algorithms for Processing-in-Memory

To enable efficient data processing using the mMPU, novel algorithms (e.g., algo-
rithms based solely on MAGIC NOR operations) need to be developed. Exploit-
ing the parallelism offered by the mMPU as described in Sect. 8.3 is essential to
optimize these algorithms in terms of energy, performance, and area. For exam-
ple, multiplying K-binary matrices, each of which is of size M × N , requires
5NK − 5K + 2M + 1 steps when optimizing the algorithm for MAGIC NOR exe-
cution within the mMPU [18]. This algorithm has a quadratic time complexity of
O(NK), while in standard von Neumann architecture, a cubic time complexity of
O(NKM ) is required. This instance exemplifies the potential performance benefits
of processing data within the memory. Hence, design of a correct algorithm is the
key for efficient processing using the mMPU.
8 mMPU—A Real Processing-in-Memory Architecture … 205

(a) Static (b) Dynamic

Processing (P)
0 0
P S P S S
t1 t1
Storage (S) P S S P S
t2 t2
P S S S P
t3 Processing (P) / t3
Storage (S)
Time Time

Fig. 8.10 a Static processing area, where a portion of the memory space is dedicated for processing
(in blue), b dynamic processing area, where a portion of memory space, variable in location and size,
is allocated for processing or storage (in blue, purple, and orange), and allocation of processing (P)
and storage (S) areas with respect to time. The tables next to the figures denote the time multiplexing
of processing and storage space for both the schemes. Symbols S and P mean storage and processing,
respectively

8.4.1.2 Processing Area

Logic execution within the mMPU requires utilization of memory cells for computa-
tion. This utilization must maintain the integrity of the data stored in the memristive
memory. For example, while calculating complex Boolean functions, several MAGIC
NOR/NOT operations must be performed, and the intermediate values of these oper-
ations are also stored within the memristors, which we call functional memristors
[19, 39]. The functional memristors must be separated from the memristors where
valid data is stored, and the Operating System (OS) has to make sure that no data is
destroyed. One straightforward solution to this problem is to allocate a fixed amount
of memory space for processing; this is known as the static processing area [18]
as shown in Fig. 8.10a. A more complicated solution is to dynamically allocate the
processing area based on the availability of the memory cells and required amount
of functional memory space for processing; this is known as the dynamic processing
area, as shown in Fig. 8.10b.
Figure 8.10 shows the difference between static and dynamic processing areas.
It also shows how the dynamic technique time multiplexes the different portions of
the available memory for processing and storage, while the static technique uses
the dedicated areas for processing and storage. While the dynamic processing area
scheme efficiently allocates the memory space without any wastage, it requires a
costly memory management. In contrast, the static processing area scheme does not
require any memory management since the area is committed at design time, but it
suffers from lower memory utilization.
206 N. Talati et al.

Fig. 8.11 The desired logic function is synthesized using ABC [32] for NOR and NOT gates
and then optimized specifically for MAGIC within memory, generating a general mapping and a
sequence of operations. The general execution is mapped to specific cells in real time, based on the
temporary state of the mMPU and its available cells [20]

8.4.2 Automatic Logic Synthesis Using SIMPLE MAGIC

The state machine of the mMPU controller is designed to execute the sequence of
required NOR and NOT operations within mMPU. Wisely exploiting the parallelism
capabilities described in Sect. 8.3 to execute numerous NOR operations simultane-
ously on different rows or columns may significantly improve the computation per-
formance. To maximize the efficiency of the computations performed by the mMPU,
the controller has to be designed to perform an optimized NOR and NOT sequence
that is optimized in terms of either latency, energy, area, or a combination of the
three. The optimized algorithm is determined automatically using SIMPLE MAGIC
[20], a tool we recently developed.
SIMPLE receives any logic function, and performs the following flow, as illus-
trated in Fig. 8.11:
1. The function is converted into a netlist of NOR and NOT gates using a modified
ABC synthesis tool [32].
2. The netlist is mapped into a memristive memory, by solving an optimization
problem, using the z3 SMT solver [8]. Thus, for every gate j, the variables of
the problem are
• The coordinated wordline and bitline of the inputs Aj , Bj and output Ej of the
gate: 
{RAj , CAj }, {RBj , CBj }, {REj , CEj } .

• The number of the clock cycle in which the gate is executed is Tj .


The mapping is done while taking into account the following constraints of
in-memory processing:
• Inputs and outputs of each MAGIC gate have to be mapped to a legal memory
cell (when the size of the memory is ROWnum × COLnum ):
8 mMPU—A Real Processing-in-Memory Architecture … 207

∀xj ∈ {Aj , Bj , Ej } : (0 < Cxj ≤ Colnum ) ∩ (0 < Rxi ≤ Rownum ). (8.7)

• The execution time of each gate is positive:

∀gate j : Tj > 0. (8.8)

• Outputs of different gates have to be mapped to different memory cells:

∀Ek , Ej : (CEj = CEk ) ∪ (REj = REk ). (8.9)

• Inputs and outputs of each MAGIC NOR gate have to be mapped to the same
column or the same row (as described in Sect. 8.3.2):

∀gate j : [(CAj = CBj = CEj ) ∩ (RAj = RBj = REj )] ∪ [(CAj = CBj = CEj )
∩(RAj = RBj = REj )].
(8.10)
• To perform several MAGIC gates in parallel, the inputs and outputs have to
be aligned (as shown in Fig. 8.7):

∀gate j, k : Tj  = Tk ∪
{{[(CAj = CAk ∩ CBj = CBk ) ∪ (CAj = CBk ∩ CBj = CAk )] ∩ (CEj = CEk )}∩
(RAj = RBj = REj ∩ RAk = RBk = REk )}∪
{{[(RAj = RAk ∩ RBj = RBk ) ∪ (RAj = RBk ∩ RBj = RAk )] ∩ (REj = REk )}∩
(CAj = CBj = CEj ∩ CAk = CBk = CEk )}.
(8.11)

• A MAGIC gate can be executed only when its inputs were produced previously
and each input has to be located in the same memory cell as the output of the
gate connected to it.

∀Eh , xj ∈ {Aj , Bj } that are connected :


(8.12)
[(CEh = Cxj ) ∩ (REh = Rxj )] ∩ (Th < Tj ).

The optimization problem can be solved for minimizing the latency, area, energy,
or a combination of them. For example, the optimization function for minimizing
latency is

Latencybest_mapping = min{maxTj }, where 0 < j ≤ #gates. (8.13)


j

3. The mapping is reshuffled in real time, according to the occupancy of the memory
at the moment the computation is done.
Automation of the process promises optimal results and reduces the time required
to design the mMPU controller. The first two steps are performed to design the
208 N. Talati et al.

Chakraborti et al. [10] Original Netlist ABC SIMPLE [23]


300
# Computation steps

250

200

150

100

50

0
5xp1 clip cm150a cm162a cm163a misex1 parity x2
Benchmarks

Fig. 8.12 Performance comparison of SIMPLE [20] (dark green) with other synthesis approaches,
which include Chakraborti et al. [5] (green), the original netlist without synthesis (blue), and the
netlist synthesized with ABC (yellow)

state machine of the arithmetic block of the mMPU controller, and the third step is
performed by the mMPU controller during run time. Figure 8.12 presents the per-
formance speedup of SIMPLE of 1.9× on average as compared to a NOT and NOR
netlist prior to optimization with SIMPLE (also before synthesizing the netlist with
ABC). Additionally, SIMPLE yields performance speedup of 1.94× compared to
previous work [5]. Two major factors contribute to the performance benefit of SIM-
PLE. SIMPLE tries to exploit the intrinsic parallelism offered by MAGIC NOR
execution within the memristive memory. Furthermore, while exploiting this paral-
lelism, SIMPLE rearranges the netlist in such a way that the copy operations of data
within the array are not required between the successive steps of execution. Current
and future improvements of SIMPLE may further increase performance.

8.5 System Design Using mMPU

Introducing an mMPU to a computing machine requires that new aspects of system


design be considered. First, the appropriate computation model for exploiting mMPU
capabilities must be chosen. Using the mMPU as a stand-alone accelerator, as shown
in Fig. 8.13a, allows us to exploit the existing knowledge about accelerator operation.
In this usage model, the mMPU address space is separated from that of the main
memory. Any data that is to be processed within the mMPU needs to be transferred
(via direct R/W operations or DMA transactions) from its original location in the main
memory to a dedicated processing location within the mMPU. Once the processing
is completed, the result needs to be copied back to a location reserved for it in the
main memory for later use.
Another optional computation model is to incorporate the mMPU address space
as a part of the (or as the entire) main memory address space, as shown in Fig. 8.13b.
8 mMPU—A Real Processing-in-Memory Architecture … 209

(a) (b)
GPU DMA R/W Data GPU DMA
CPU mMPU CPU
Compute Cry- Cry-
DSP DSP
CMD pto pto
Accelerators Compute Accelerators
CMD
System Bus System Bus
R/W Data Argument and Results

DRAM
DRAM mMPU
(optional)

Memory Sub-System Memory Sub-System

Fig. 8.13 Illustration of the possible mMPU usage models. When using the mMPU as a an acceler-
ator, data to be processed is copied from the main memory to the mMPU and computing commands
are sent from the CPU. When using the mMPU as b a part of the main memory, the data meant
for processing is stored beforehand in the mMPU address space, allowing the commencement of
processing with a single command from the CPU

Combined with careful data allocation, this usage model may avoid most of the data
transfers and further speed up computation. This enhancement, however, comes at
the cost of more complicated control (discussed later in this section), and with the
need to reserve parts of the available memory space (otherwise used to store data)
for intermediate results of the computation.
Data coherency also must be addressed. Using the mMPU allows data to be
modified in its location within the main memory and without modifying any instances
of the same data down the memory hierarchy (i.e., in caches). Therefore, maintaining
data coherency requires an added capability to invalidate data in caches if the data
was changed by the mMPU. When the mMPU is used as an accelerator, data that is
processed needs to be locked against changes (by using an atomic operation or some
other means) to avoid it being changed while the mMPU is processing. The concepts
of data redundancy and memory reliability also need to be addressed in order for
a system containing the mMPU to be seamlessly compatible with existing SW and
data correction mechanisms.
A programming model must be suited for each usage model for efficient utilization
of the mMPU. Because the rest of the system should be as oblivious to the mMPU as
possible, standard interfaces should be adopted, and the mMPU should be designed
so that minimal changes to the rest of the system are required. Furthermore, apart
from using mMPU for data processing, it can also be selectively used as the system
memory, making it compatible with the von Neumann computing model. Rather than
being burdened with challenging optimization tasks as in the case of conventional
architectures, for the general use case, the programmer only has to determine the
desired operation, the addresses of the inputs and outputs, and the size of the inputs.
Such an accelerator is addressed with software support, i.e., additional libraries with
specific functions that the mMPU will support, such as CUDA [44] in NVIDIA
210 N. Talati et al.

Fig. 8.14 Examples of the structure of an mMPU instruction. In a conventional memory access
instruction a the instruction is composed of a direction (read/write) bit, an address field, and a data
field. An instruction for in-memory computing b is always in the write direction, and written to
an address which is reserved by the controller for computing instructions. The rest of the bits are
used to transmit any information needed for the execution of the command and may specify the
operation to be carried out, the input/output location, size, etc.

GPUs. In this case, the CPU will offload the code to the mMPU directly without the
need to modify the ISA or the current conventional systems.
Two approaches are proposed for utilizing the mMPU as a memory capable of
computing. The first requires extending the ISA with additional commands that the
mMPU supports. These commands will be successively dispatched by the CPU to the
mMPU so that computation tasks are performed on specified locations in the memory
(i.e., addresses). In the second approach, the mMPU will have a reserved address,
which when written to will initiate the equivalent command. Thus, an instruction
for in-memory computing contains a write operation to a reserved address that is
mapped to a dedicated register within the mMPU controller. The instruction must
contain all the relevant information for execution, such as the required operation,
operands and result location, and size. An example of such an instruction is shown
in Fig. 8.14.

8.6 Conclusions

Data transfer between processing and memory units is the major performance and
energy-efficiency bottleneck of modern computing systems, commonly known as the
von Neumann bottleneck. Whereas prior art has tried to reduce the distance between
processing and memory units to solve this problem, we propose the mMPU, an
entirely different solution that can tackle the von Neumann bottleneck even more
efficiently. In the mMPU, we rely on employing memristive memory cells directly
for processing, which largely eliminates the necessity for data transfer. We also
present MAGIC, a technique to execute logical operations within the memristive
memory crossbar without any modification of the memory structure. We further
show how to extend execution of a single MAGIC gate to a parallel execution of
several MAGIC gates within the memory crossbar. We present our recent works
on the mMPU microarchitecture design, which includes the mMPU controller and
an automatic logic synthesis tool. Finally, we describe implications of the system
integration of the mMPU while using it in two different ways, i.e., an accelerator
mode and in a main memory mode. Applications that will benefit the most from this
new architecture include deep learning, image processing, DNA sequencing, and
8 mMPU—A Real Processing-in-Memory Architecture … 211

matrix multiplication, which have a high degree of intrinsic parallelism and large
amounts of data.

References

1. R. Balasubramonian, B. Grot, Near-data processing. IEEE Micro 36(1), 4–5 (2016). https://
doi.org/10.1109/MM.2016.1
2. B. Black, Die Stacking is Happening! Proceedings of the International Symposium on Microar-
chitecture (2013)
3. M.N. Bojnordi, E. Ipek, Memristive Boltzmann machine: a hardware accelerator for combina-
torial optimization and deep learning. In: 2016 IEEE International Symposium on High Per-
formance Computer Architecture (HPCA) (2016), pp. 1–13. https://doi.org/10.1109/HPCA.
2016.7446049
4. Y. Cassuto, S. Kvatinsky, E. Yaakobi, Sneak-path constraints in memristor crossbar arrays. In:
Proceedings of the IEEE International Symposium on Information Theory (ISIT) (2013), pp.
156–160
5. S. Chakraborti, P.V. Chowdhary, K. Datta, I. Sengupta, Bdd based synthesis of boolean functions
using memristors. In: 2014 9th International Design and Test Symposium (IDT) (2014), pp.
136–141. https://doi.org/10.1109/IDT.2014.7038601
6. Y.C. Chen et al., An access-transistor-free (0T/1R) non-volatile resistance random access mem-
ory (RRAM) using a novel threshold switching, self-rectifying chalcogenide device. In: IEEE
International on Electron Devices Meeting IEDM ’03 Technical Diges (2003), pp. 37.4.1–
37.4.4
7. P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, PRIME: A novel processing-
in-memory architecture for neural network computation in ReRAM-based main memory. In:
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
(2016), pp. 27–39. https://doi.org/10.1109/ISCA.2016.13
8. L. De Moura, N. Bjørner, Z3: an efficient SMT solver. In: Tools and Algorithms for the Con-
struction and Analysis of Systems (2008), pp. 337–340
9. P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, H. Noyes, An efficient and scalable
semiconductor architecture for parallel automata processing. IEEE Trans. Parallel Distrib. Syst.
25(12), 3088–3098 (2014). https://doi.org/10.1109/TPDS.2014.8
10. Y. Eckert, N. Jayasena, G.H. Loh, Thermal feasibility of die-stacked processing in memory.
In: Proceedings of the 2nd Workshop Near-Data Processing (2014)
11. D.G. Elliott, M. Stumm, W.M. Snelgrove, C. Cojocaru, R. Mckenzie, Computational RAM:
implementing processors in memory. IEEE Des. Test Comput. 16(1), 32–41 (1999). https://
doi.org/10.1109/54.748803
12. M. Gokhale, B. Holmes, K. Iobst, Processing in memory: the Terasys massively parallel PIM
array. Computer 28(4), 23–31 (1995). https://doi.org/10.1109/2.375174
13. L. Guckert, E.E. Swartzlander, MAD gates: Memristor logic design using driver circuitry. IEEE
Trans. Circuits Syst. II Exp. Briefs 64(2), 171–175 (2017). https://doi.org/10.1109/TCSII.2016.
2551554
14. Q. Guo, X. Guo, Y. Bai, E. Ipek, A resistive TCAM accelerator for data-intensive computing. In:
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.
ACM (2011), pp. 339–350
15. Q. Guo, X. Guo, R. Patel, E. Ipek, E.G. Friedman, AC-DIMM: associative computing with
STT-MRAM. ACM SIGARCH Comput. Arch. News 41(3), 189–200 (2013)
16. HSA Foundation: Harmonizing the Industry Around Heterogeneous Computing, http://www.
hsafoundation.com/
17. J.J. Huang, Y.M. Tseng, W.C. Luo, C.W. Hsu, T.H. Hou, One selector one resistor (1s1r)
crossbar array for high-density flexible memory applications. IEEE (2011), pp. 31.7.1–31.7.4
212 N. Talati et al.

18. R.B. Hur, S. Kvatinsky, Memristive memory processing unit (MPU) controller for in-memory
processing. In: 2016 IEEE International Conference on the Science of Electrical Engineering
(ICSEE) (2016), pp. 1–5. https://doi.org/10.1109/ICSEE.2016.7806045
19. R.B. Hur, N. Talati, S. Kvatinsky, Algorithmic considerations in memristive memory processing
units (MPU). In: CNNA 2016 15th International Workshop on Cellular Nanoscale Networks
and their Applications (2016), pp. 1–2
20. R.B. Hur, N. Wald, N. Talati, S. Kvatinsky, SIMPLE MAGIC: synthesis and in-memory MaP-
ping of logic execution for memristor-aided loGIC. In: Proceeding of the IEEE International
Conference on Circuits Aided Design (2017)
21. Hybrid Memory Cube Consortium, Hybrid Memory Cube Specification 1.0 (2013)
22. JEDEC Solid State Technology Association: High Bandwidth Memory (HBM) DRAM, http://
www.jedec.org/standards-documents/results/jesd235
23. S. Kvatinsky, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C. Weiser, Memristor-based
material implication (imply) logic: design principles and methodologies. IEEE Trans. Very
Large Scale Integr. (VLSI) Syst. 22(10), 2054–2066 (2014). https://doi.org/10.1109/TVLSI.
2013.2282132
24. S. Kvatinsky, N. Wald, G. Satat, A. Kolodny, U.C. Weiser, E.G. Friedman, MRL–memristor
ratioed logic. In: 2012 13th International Workshop on Cellular Nanoscale Networks and their
Applications (2012), pp. 1–6. https://doi.org/10.1109/CNNA.2012.6331426
25. S. Kvatinsky, E.G. Friedman, A. Kolodny, U.C. Weiser, The desired memristor for circuit
designers. IEEE Circuits Syst. Mag. 13(2), 17–22 (2013). https://doi.org/10.1109/MCAS.2013.
2256257
26. S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C.
Weiser, MAGIC - memristor-aided logic. IEEE Trans. Circuits Syst. II Express Briefs 61(11),
895–899 (2014). https://doi.org/10.1109/TCSII.2014.2357292
27. J. Lee, M. Jo, D. Jun Seong, J. Shin, H. Hwang, Materials and process aspect of cross-point
RRAM (invited). Microelectron. Eng. 88(7), 1113–1118 (2011)
28. Y. Levy, J. Bruck, Y. Cassuto, E.G. Friedman, A. Kolodny, E. Yaakobi, S. Kvatinsky, Logic
operations in memory using a memristive akers array. Microelectron. J. 45(11), 1429–1437
(2014)
29. H. Li et al., Write disturb analyses on half-selected cells of cross-point rram arrays. In: Pro-
ceedings of the IEEE International Reliability Physics Symposium (2014), pp. MY.3.1–MY.3.4
30. S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, Y. Xie, Pinatubo: a processing-in-memory architecture for
bulk bitwise operations in emerging non-volatile memories. In: Design Automation Conference
(DAC) (2016), pp. 1–6. https://doi.org/10.1145/2897937.2898064
31. W. Lynch, Worst-case analysis of a resistor memory matrix. IEEE Trans. Comput. C–18(10),
940–942 (1969)
32. A. Mishchenko, ABC: a system for sequential synthesis and verification (2012), http://www.
eecs.berkeley.edu/~alanmi/abc/
33. M. Oskin, F.T. Chong, T. Sherwood, Active pages: a computation model for intelligent memory.
SIGARCH Comput. Archit. News 26(3), 192–203 (1998). https://doi.org/10.1145/279361.
279387
34. G. Papandroulidakis, I. Vourkas, N. Vasileiadis, G.C. Sirakoulis, Boolean logic operations and
computing circuits based on memristors. IEEE Trans. Circuits Syst. II Exp. Briefs 61(12),
972–976 (2014). https://doi.org/10.1109/TCSII.2014.2357351
35. D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K.
Yelick, A Case for Intelligent RAM. IEEE Micro 17(2), 34–44 (1997). https://doi.org/10.1109/
40.592312
36. D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K.
Yelick, Intelligent RAM (IRAM): chips that remember and compute. In: 1997 IEEE Inter-
national Solids-State Circuits Conference. Digest of Technical Papers (1997), pp. 224–225.
https://doi.org/10.1109/ISSCC.1997.585348
37. J. Reuben, R. Ben-Hur, N. Wald, N. Talati, A.H. Ali, P.E. Gaillardon, S. Kvatinsky, Memristive
logic: a framework for evaluation and comparison. In: International Symposium on Power and
Timing Modeling, Optimization, and Simulation (PATMOS) (2017) (in press)
8 mMPU—A Real Processing-in-Memory Architecture … 213

38. S. Shin, K. Kim, S.M. Kang, Analysis of passive memristive devices array: data-dependent
statistical model and self-adaptable sense resistance for RRAMs. Proc. IEEE 100(6), 2021–
2032 (2012)
39. N. Talati, S. Gupta, P. Mane, S. Kvatinsky, Logic design within memristive memories using
memristor-aided loGIC (MAGIC). IEEE Trans. Nanotechnol. 15(4), 635–650 (2016). https://
doi.org/10.1109/TNANO.2016.2570248
40. K. Wang, Y. Qi, J.J. Fox, M.R. Stan, K. Skadron, Association rule mining with the micron
automata processor. In: 2015 IEEE International Parallel and Distributed Processing Sympo-
sium (2015), pp. 689–699. https://doi.org/10.1109/IPDPS.2015.101
41. H.S.P. Wong, H.Y. Lee, S. Yu, Y.S. Chen, Y. Wu, P.S. Chen, B. Lee, F.T. Chen, M.J. Tsai, Metal
oxide RRAM. Proc. IEEE 100(6), 1951–1970 (2012). https://doi.org/10.1109/JPROC.2012.
2190369
42. W. Woods, M.M.A. Taha, S.J.D. Tran, J. Brger, C. Teuscher, Memristor panic: a survey of
different device models in crossbar architectures. In: Proceedings of the 2015 IEEE/ACM
International Symposium on Nanoscale Architectures (NANOARCH15) (2015), pp. 106–111.
https://doi.org/10.1109/NANOARCH.2015.7180595
43. L. Xie, H.A.D. Nguyen, M. Taouil, S. Hamdioui, K. Bertels, Fast boolean logic mapped on
memristor crossbar. In: International Conference on Computer Design (2015), pp. 335–342.
https://doi.org/10.1109/ICCD.2015.7357122
44. C.T. Yang, C.L. Huang, C.F. Lin, Hybrid cuda, openmp, and mpi parallel programming on
multicore gpu clusters. Comput. Phys. Commun. 182(1), 266–269 (2011)
45. L. Yavits, S. Kvatinsky, A. Morad, R. Ginosar, Resistive associative processor. IEEE Comput.
Arch. Lett. 14(2), 148–151 (2015). https://doi.org/10.1109/LCA.2014.2374597
46. Y. Zha, J. Li, Reconfigurable in-memory computing with resistive memory crossbar. In: Inter-
national Conference on Computer-Aided Design (2016), pp. 1–8. https://doi.org/10.1145/
2966986.2967069
47. M.A. Zidan, H.A.H. Fahmy, M.M. Hussain, K.N. Salama, Memristor-based memory: the sneak
paths problem and solutions. Microelectron. J. 44(2), 176–183 (2013)
Chapter 9
Spintronic Logic-in-Memory Paradigms
and Implementations

Wang Kang, Erya Deng, Zhaohao Wang and Weisheng Zhao

Abstract In current big data era, the limitation of data transfer bandwidth (memory
wall) between the processor and the memory, and the increase of energy consumption
associated with the data transfer (power wall) have become the most urgent problems
for conventional von-Neumann architecture, owing to the physical separation of
the processor and the memory units (see Fig. 9.1a) and the performance mismatch
between the two.

9.1 Introduction

In current big data era, the limitation of data transfer bandwidth (memory wall)
between the processor and the memory, and the increase of energy consumption
associated with the data transfer (power wall) have become the most urgent prob-
lems for conventional von-Neumann architecture, owing to the physical separation
of the processor and the memory units (see Fig. 9.1a) and the performance mismatch
between the two [1–3]. On one hand, the workloads are growing exponentially with
time in current big data era, such as big data analytics, artificial intelligence, and bioin-
formatics, which generally operate on large data-sets, leading to frequent accesses to
the off-chip memory. On the other hand, moving data may even be much more expen-
sive than computing itself, e.g., a DRAM access needs 200 times more energy than
a floating-point operation [4, 5]. Increasing the available bandwidth, through either
increasing the number or the frequency of the channel, is a robust solution to address
the communication bottleneck, which, however, significantly increases the cost and is
not scalable [6]. Recent hardware/architecture design paradigms have moved towards
greater specialization, and specialized units for memory-centric computing are vital
to future solutions [7, 8]. The logic-in-memory (LIM) paradigm, which attempts
to embed computation capability into the memory, and to realize the unity of data

W. Kang (B) · E. Deng · Z. Wang · W. Zhao


Fert Beijing Research Institute, BDBC, School of Microelectronics, Beihang University, Beijing
100191, China
e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2020 215


M. Suri (ed.), Applications of Emerging Memory Technology,
Springer Series in Advanced Microelectronics 63,
https://doi.org/10.1007/978-981-13-8379-3_9
216 W. Kang et al.

storage and processing in the same die/chip, thus exhibiting great feasibility to break
the communication bottleneck of conventional von-Neumann architecture [9–17].
The basic concept of LIM can be back to 1970s [9], and the initial idea is to add
some logic units close to the memory chips through a plane (see Fig. 9.1b) or 3D
(see Fig. 9.1c) structure, to perform operations being simple yet bandwidth-intensive
and/or latency-sensitive [6]. Strictly speaking, the initial LIM concept is more like
logic-near-memory (LNM) by reducing the data transfer distance or by adding more
memory close to the processor but without decreasing the memory access [18–21].
The premise is that being close to the memory chips, the LNM module will have much
lower latency and higher bandwidth to the memory than to the processor, thus reduc-
ing the off-chip memory bandwidth requirements as well as improving the system
performance and energy efficiency [7]. Many prior and recent works have proposed
various approaches. Based on the degree of integration between logic and memory,
they can be classified into these two broad categories, i.e., LNM and LIM. Please note
that several similar terminologies are also used in different communities, such as in-
memory computing, computing-in-memory, near-memory-computing, in-memory-
processing, processing-near-memory, and processing-in-memory. Although both
LNM and LIM can alleviate the communication bottleneck, the latter one fundamen-
tally innovates the computing architecture from conventional von-Neumann archi-
tecture and brings benefit of reducing the number of memory accesses (see Fig. 9.1d).
In this chapter, we focus on the LIM paradigm. To date, the LIM research has fea-
tured a rich design space, from device, circuit, to architecture innovations, however,
such promising studies could not render practical prototypes due to the incompati-
bility of the state-of-the-art logic and memory technologies in terms of design com-
plexity and fabrication cost [7]. The emergence of 3D integration technology and
nonvolatile memories (NVMs) provide alternative possibilities for effectively and
efficiently implementing LIM hardware [2, 5, 7, 8, 13–15, 21–25]. On one hand,
the 3D-stacking functionality of the NVM devices allows decoupling the logic and
memory circuits in different manufacturing processes by using the back-end-of-line
(BEOL) process, therefore alleviating the fabrication complexity as well as cost. On
the other hand, the resistance-based storage mechanisms of the NVM devices provide
inherent logic functions, thus enabling to embed energy-efficient logic computing
capability within the memory [5].
Recently, a lot of studies have demonstrated that NVMs, such as resistive random-
access memory (ReRAM) [23, 24], magnetic RAM (MRAM) [25, 26], and phase
change memory (PCM) [5, 27], are qualified for performing logic operations beyond
data storage. The NV devices act as both logic and memory units in the same die, thus
promising a radical renovation of the relationship between computation and mem-
ory. The NVM-based LIM architectures exploit either the peripheral circuitry (e.g.,
sensing circuits) or the memory cells already existing inside the memory die (or with
minimal changes) rather than adding new logic units in the memory chip to perform
computing tasks. For example, ReRAM can perform matrix-vector multiplication
efficiently in a crossbar structure. It has been widely studied to represent multistate
synapses in neural computation [24, 28]. On the other hand, Boolean logic opera-
tions in ReRAM, MRAM, and PCM have also been widely studied, e.g., through
9 Spintronic Logic-in-Memory Paradigms and Implementations 217

(a) (b) (c) (d)


von-Neumann architecture Logic-near-memory Logic-near-memory Logic-in-memory
(LNM) (LIM)
CPU CPU CPU CPU
core core core core External CPU
Memory Memory
Cache Cache
Reconfigurable
Memory Memory
Channel Channel LIM cores
Logic unit Logic unit
Memory blocks
Memory Memory Memory Memory
3D stacking

Fig. 9.1 Possible evolution of the computing architecture; a conventional von-Neumann architec-
ture with a separated processor (central processing unit, CPU) and memory; b, c the logic-near-
memory (LNM) architecture with plane and 3D implementations by adding a small amount of logic
units close to the memory or by adding more memory close to the processor; d the logic-in-memory
(LIM) architecture attempts to embed computation capability into the memory, and to realize the
unity of data storage and processing at the smallest grain in the same die [8]

material implication (IMP) or sequential iteration, by exploiting the conditional tog-


gling switching property of the resistive devices [23, 29, 30]. Nevertheless, most of
the proposed schemes can only perform some application-specific logic functions.
An approach supports a complete set of logic functions is preferable for a general-
purpose LIM design. In addition, logic tasks generally require additional performance
on the device/memory in terms of dynamic energy, switching speed and endurance,
in comparison with data storage tasks. Spintronic memories have the most poten-
tial for LIM implementations when taking into consideration all these performance
requirements. In this chapter, we will focus on the spintronic LIM paradigms and
introduce three spintronic LIM approaches and implementations.

9.2 Spintronic LIM Using Hybrid Spintronic/CMOS


Circuitry

LIM exploits the 3D integration capability of the spintronic devices (mainly refer
to magnetic tunnel junctions, MTJs [31, 32]) to reduce the global routings and data
transfer distance between the memory and the logic units, as shown in Fig. 9.2. More
importantly, by embedding the nonvolatile spintronic devices, the temporarily unused
blocks could be completely powered off during the idle state while maintaining the
data, thus saving standby power. Moreover, data can be instantaneously recovered,
therefore, this approach is suitable for the “instant-on” and “normally-off” systems.
In addition, the area can be largely reduced since the same spintronic devices are
fabricated on top of the CMOS circuits and do not occupy extra area [33].
Figure 9.3a illustrates the schematic of the hybrid spintronic/CMOS-based LIM
architecture [34, 35], which is mainly composed of three main parts: (a) a current-
218 W. Kang et al.

MTJ
Spintronic memory

Logic CMOS
/Metal

Si

Fig. 9.2 3D spintronic/CMOS LIM architecture which integrates non-volatility into the logic cir-
cuits

mode sense amplifier to detect the currents of the two branches, and then to evaluate
the logic output result; (b) a writing block to program the data stored in the spintronic
memory cells; (c) a CMOS logic network (LN) that performs the logic computation.
LN contains MTJ devices for nonvolatile inputs and a CMOS logic tree for volatile
inputs in order to keep an area-power-efficient advantage. In this case, the volatile
logic data can be driven by a high processing frequency contrarily to the nonvolatile
data stored in the spintronic memory cells, which should be changed with a relatively
low frequency, i.e., they are quasi-constant for computing. The CMOS transistors
and MTJs are the main components of LN, as shown in Fig. 9.3b [33, 36].
• CMOS transistor is used as a variable resistor, whose resistance is controlled by
an external volatile input voltage (X ) applied to the gate (G) terminal. If X = ‘1’,
the CMOS transistor is conducted with a low resistance (R O N ∼ k). Otherwise,
the CMOS transistor is blocked and has a high resistance (R O F F ∼ G).
• The MTJ device is used not only as a storage element but also as a logic input
operand. It has a low resistance (R P ) and stores a logic data ‘1’ (Y = ‘1’) when it
is in a parallel state; otherwise, if the MTJ is in an antiparallel state, its resistance
becomes high (R A P ) and it stores a logic data ‘0’ (Y = ‘0’). The resistance differ-
ence between two resistances depends on the tunneling magnetoresistance effect
(TMR) ratio.
The reading current (I L or I R ) is inversely proportional to the total resistance
 or R R ) of the left or the right branch in the LN. Two complementary outputs
(R L
z and z corresponding to the two opposite logic values are determined by the read-
ing currents, providing differential logic operations. If the current of the left branch is
larger than that of the right branch (I L > I R ), the output results on nodes z and z are
‘1’ and ‘0’, respectively; otherwise if I L < I R , the corresponding output results are
then z = ‘0’ and z’ = ‘1’, respectively. By configuring the LN, different nonvolatile
logic functions can be realized, such as OR/NOR, AND/NAND, XOR/NXOR, look-
up-table, flip-flop, full adder. More details can refer to [37–42]. Figure 9.4 shows the
LN configurations for different logic operations that are proposed and analyzed in
9 Spintronic Logic-in-Memory Paradigms and Implementations 219

Complementary Z=F(X,Y) Complementary


external inputs outputs
X ={ x1, x1’, x2, x2’, ..., x i , xi’} Complementary Z ={ z1, z1’, z2, z2’, ..., z n, z n’ }
stored inputs
Y ={ y1, y1’, y2 , y2’, ..., yj , yj’ }

x i , yj , zn ϵ{0,1}

z z =1, z ’=0 if I L > I R


Sense Amplifier (SA)
z’ z =0, z ’=1 if I L < I R
IL IR

Volatile Logic Data CMOS Logic Tree


if x =1
IW CMOS
transistor
x
RT = {RR ON

OFF if x =0
Writing Circuit

if y =1
Non-volatile Logic Data MTJ y R MTJ = {RR P

if y =0
(MTJs) AP
Logic Network (LN)

(a) (b)
Fig. 9.3 a Schematic of the hybrid spintronic/CMOS-based LIM architecture; b components in
the logic network (LN) [36]

[40]. Figure 9.5 shows an example of 1-bit full adder [41] based on the above spin-
tronic LIM paradigm. The CMOS logic tree of the full adder is designed according to
(9.1)–(9.4), where A (/A: the complement of A) and Ci (/Ci : the complement of Ci )
are the volatile input operands while B (/B) relates to the nonvolatile input operand
stored in the MTJs [33].

SU M = A ⊕ B ⊕ Ci = ABCi + ABCi + ABCi + ABCi (9.1)

SU M = ABCi + ABCi + ABCi + ABCi (9.2)

Co = AB + ACi + BCi (9.3)

Co = AB + ACi + BCi (9.4)

By integrating the spintronic devices directly into the logic circuits, power sup-
ply can be cut off during the standby mode. Therefore, the hybrid spintronic/CMOS
based LIM architecture could provide a way to realize ultra-low power consump-
tion and high-performance computing capability for the next generation processor.
Moreover, some computing system paradigms, such as brain-inspired computing, are
220 W. Kang et al.

(a) (b) (c)


A A A
Q Q Q
B AND B OR B XOR

/Qm Qm /Qm Qm
IL IR IL IR /Qm Qm
IL IR
A /A A /A A /A
/A A /A A

B /B B /B B /B
LB RB
LB RB LB RB
M
M M

Fig. 9.4 Structure of the logic network (LN) for nonvolatile a AND logic gate b OR logic gate
c XOR logic circuit. “LB” and “RB” represent the left and right branches, respectively [40]

Vdd PCSA
CLK CLK CLK P2 P6 P7 P3 CLK
P0 P4 P5 P1
Truth table
/SUM SUM /Co Co
N0 N1
A B Ci SUM Co
N2 N3
0 0 0 0 0
A
N6 N7 /A N8 N9 A 0 0 1 1 0
A
N14 N15
Ci /A /Ci 0 1 0 1 0
N16 N17
Ci /Ci Ci 0 1 1 0 1
N10 N11 N12 N13
CMOS logic tree 1 0 0 1 0
Vdda 1 0 1 0 1
V0 V2 V0 V2 1 1 0 0 1
P8 P9 P10 P11
B /B B /B 1 1 1 1 1
V1 V3 V1 V3
N12 N13 N14 CLK N15
CLK N4 N5

Writing circuit
Gnd Gnd
SUM sub-circuit CARRY sub-circuit

Fig. 9.5 Full schematic and truth table of the 1-bit full adder based on the hybrid spintronic/CMOS
LIM architecture [41]

expected to be realized by using the spintronic/CMOS-based architecture [43]. This


hybrid LIM structure can also be directly extended to other resistive memory devices,
such as domain-wall racetrack memory, ReRAM, and PCM, by simply replacing the
spintronic devices (MTJs) with other resistive memory devices. Despite the advan-
tages described above, this LIM approach also suffers from several challenges which
should be properly addressed, e.g., the switching latency (about several ns) of the
nonvolatile device is much larger than that of the conventional CMOS transistors,
resulting in relatively lower computing frequency. Another challenge is the reliability
9 Spintronic Logic-in-Memory Paradigms and Implementations 221

mainly caused by the device mismatch (both CMOS and nonvolatile devices) of the
sensing circuit. Unlike the memory chips where complex error correction circuits
(ECCs) can be employed, it is difficult to embed ECCs in the logic circuits while
keeping high speed, high power efficiency, and low area simultaneously. Therefore,
alternative high-reliability solutions should be presented for this approach. Current
research efforts that concentrate on this topic are fast-access and high-TMR MTJ
development, high-performance sensing circuit design, low-cost and reliable inte-
gration process, etc. [2].

9.3 Spintronic LIM Using Peripheral Circuitry

In this approach, the core memory cell array is exactly the same as a standard memory,
thus the storage density and energy efficiency of the regular read and write operations
can be maintained. The basic concept to perform logic computation is to exploit
the peripheral circuitry (e.g., read circuit) for performing a range of bulk bit-wise
arithmetic operations [44–48]. Figure 9.6 shows the circuit schematic of a typical
spin transfer torque magnetic random-access memory (STT-MRAM) bank and the
related 1T1MTJ bit-cell and the peripheral sense circuit. Here 1T1MTJ refers to one
CMOS transistor connected with one MTJ device in series. The STT-MRAM bank
is generally organized with an array of 1T1MTJ bit-cells via a number of bit-lines
(BLs), source-lines (SLs), word-lines (WLs) and peripheral circuits, e.g., write/read
drivers, row/column decoders, and input/output (I/O) interfaces.

VDD
Sense Amplifiers
Wordline Driver

Ctrl Ctrl
P0 P1 P2 P3
Row Decoder

Out Out_bar
STT-MRAM N0 N1
row address

Bit-Cell Array
1T1MTJ bit-cells

Bit-line
Write Driver

Column Decoder MTJ


column address Word-line
Transistor
Cmd/Addr Bank I/O
Source-line
Command/Address Data

Fig. 9.6 Schematic of a STT-MRAM bank and the associated 1T1MTJ bit-cell structure and sense
amplifier [44]
222 W. Kang et al.

(a) Vdata (b) Vdata


Vref SA Vref SA

BL SL RL SL BL SL RL SL

data0 ref1 data0 ref0

data1 ref1 data1 ref1


0.5(V P+V AP) 0.5(V P,P+V AP,P) 0.5(V AP,P+V AP,AP)

VP VAP VP,P VAP,P VAP,AP


Read OR AND

Fig. 9.7 The key concept of difference reference selections to perform a memory read and b LIM
operations [46]

Different reference thresholds can be chosen to perform memory read and LIM
operations [46]. As shown in Fig. 9.7a, for a memory read operation, an addressed
memory cell is selected by the target BL, WL, and SL, and is embedded in the read
path to generate a data  sense voltage (Vdata ), which will be compared with a ref-
erence voltage Vr e f through a sensing amplifier. Owing to different states of the
selected bit-cell (parallel or antiparallel state corresponding to low or high resistance,
R P or R A P ), Vdata could be V P or V A P (V P < V A P ) respectively. Thus, by setting
the reference voltage at (V P + V A P )/2, the sense amplifier outputs a binary bit ‘1’
or bit ‘0’ when Vdata > Vr e f or Vdata < Vr e f . For comparison, Fig. 9.7b depicts the
sensing-based LIM operations (with two input operands as an example) using the
peripheral read circuit, where two memory bit-cells are addressed simultaneously.
Owing to the different resistance combinations of the two selected bit-cells, i.e.,
(R A P , R A P ), (R A P , R P ), and (R P , R P ), three different data sense voltages Vdata ,
denoted as V A P,A P , V A P,P , and V P,P , respectively, could be generated. Consider
setting the reference
 voltage Vr e f as (V A P,A P + V A P,P )/2 by tuning the reference
resistance Rr e f , the sense amplifier only outputs binary ‘1’ when both selected
bit-cells are in antiparallel states, i.e., Vdata > Vr e f . Thus, this sensing operation
with modified reference voltage performs an AND/NAND logic operation taken the
binary data stored in the two bit-cells as the two logic input operands. Similarly, when
the reference voltage is shifted to (V P,P + V A P,P )/2, the OR/NOR logic operation
can be performed. More details can refer to related papers [45, 46]. A XOR logic
operation can also be realized when the two sensing schemes shown in Fig. 9.7 are
used in conjunction with a CMOS-based NOR logic gate or by modifying the sensing
circuit [45]. Furthermore, a full adder and other more complex logic functions can
be achieved by a combination of the above-described operations [45]. This approach
can be extended to the case with multiple input operands by tuning the corresponding
9 Spintronic Logic-in-Memory Paradigms and Implementations 223

references. In summary, through tuning the reference resistances, STT-MRAM can


perform reconfigurable Boolean logic operations with the regular memory-like read
operations by utilizing the peripheral read circuit. Recent studies have extended the
above concept to complementary STT-MRAM [44], spin–orbit torque MRAM [47]
and domain-wall memory [14]. It is worth noting that for data stored in different
banks/blocks, local in-memory data transfer is still required. This approach is rather
efficient for bulk bit-wise Boolean logic operations without the need of frequent data
update. Recent studies have applied this approach to some data-intensive applica-
tions, such as image edge detection [46], data encryption [14] and neural networks
[48].

9.4 Spintronic LIM Using Memory Cells

This approach exploits the memory cells for logic operations and the key idea is to
dynamically configuring the memory cell states with a regular memory-like write
and read operations depending on the combination of the logic input operands. The
initial data stored in the memory cell acts as one of the input operands and the logic
output is represented by the final resistance state of the memory cell, which is in situ
stored in the same memory cell through a regular memory-like write operation and
can be output via the sense amplifier with a regular memory-like readout manner [49,
50]. Below we take an advanced spintronic memory, which is based on the three-
terminal voltage-gated spin Hall effect (VG-SHE) based MTJ devices [51, 52], to
describe the LIM concept and implementation.
Figure 9.8 shows the schematic and switching behavior of the VG-SHE-MTJ
device, which exploits both the SHE [53, 54] and voltage-controlled magnetic
anisotropy (VCMA) effect [55, 56] for MTJ switching. For SHE-driven MTJ switch-
ing mechanism, the critical current can be modulated by applying a bias voltage across
the MTJ via the VCMA mechanism.
The key idea for logic computation is to modulate the final resistance state (denoted
as the stateful logic output result) of the MTJ device with two different inputs (i.e.,
the VCMA bias voltage and the SHE write current). Without loss of generality, we
can assume that a high resistance (low resistance) state of the MTJ represents a
logical data ‘1’ (data ‘0’). Furthermore, we can assume that the first input data (A) is
denoted by the VCMA bias voltage (Vb ). In specific, a positive VCMA bias voltage
(with amplitude +Vb = 600 mV) denotes the logical inputs “A = 1” while a zero
VCMA bias voltage denotes the logical inputs “A = 0”. The second input data (B)
is denoted by the initial data value (i.e., resistance) stored in the MTJ device. The
third input (C) is denoted by the polarity of the SHE write current (I S H E ). A positive
SHE write current (+I S H E ) denotes logical input “C = 1” while a negative SHE
write current (−I S H E ) denotes logical input “C = 0”, respectively. Here, we need
|IC1 | < |I S H E | < |IC2 |, e.g., |I S H E | = 65 µA for correct logic computation. In this
configuration, if A = 1 (Vb = +600 mV), the critical current for SHE-driven MTJ
magnetization switching is |IC1 |, I S H E can switch the MTJ state and the final MTJ
224 W. Kang et al.

(a) (b)
z VMTJ Vb1

Normalized resistance (Ω)


Vb2
y
x logic “1”
pinned layer
oxide barrier MTJ
free layer
logic “0”
ISHE
spin
charge current
current heavy metal -IC2 -I -I
SHE C1
+IC1 +ISHE +IC2

Current (µA)
(c) Vb1 (d)
1.0

Vb0
E (V ) Vb1 = 400 mV
mZ

“P” “AP” b b1 0.5


Vb2
Vb2 = 600 mV
E (V ) 0.0
b b2
-100 -50 0 50 100
Current (μA)
(e)
160
Critical Current (μA)

140 Critical current is linearly proportional


120 to the bias voltage Vb across the MTJ
100
80
60
40
20
100 150 200 250 300 350 400 450 500 550 600
Voltage applied across the MTJ, Vb (mV)

Fig. 9.8 Three-terminal VG-SHE-driven MTJ device. a Device schematic; b voltage-gating mech-
anism on the critical current for SHE-driven magnetization switching under different bias voltages;
c, d illustration of the energy barrier and the corresponding magnetization switching under two
different bias voltages; e the critical SHE switching current as a function of the applied bias voltage
across the MTJ device [49]

state depends on the polarity of I S H E ; otherwise, if A = 0 (Vb = 0 mV), the critical


current is |IC2 |, I S H E cannot switch the state of the MTJ and the MTJ remains the
initial data. Based on the above configurations, we can realize a stateful Boolean
logic function with a single VG-SHE-driven MTJ, expressed as

Bi+1 = AC + ABi (9.5)


9 Spintronic Logic-in-Memory Paradigms and Implementations 225

Fig. 9.9 Stateful (a) (b)


reconfigurable logic via a
single VG-SHE-driven MTJ
(A,C )=(0,0);(0,1);(1,1) Bi A C Bi +1
device; a state transition
diagram; b truth table; c, (A,C )=(1,1) 0 0 0 0
d Karnaugh map [49] B=1 0 1 0 0
1 0 0 1
1 1 0 0
0 0 1 0
B=0 0 1 1 1
(A,C )=(1,0)
1 0 1 1
(A,C )=(0,0);(0,1);(1,0) 1 1 1 1
(c) (d)
Bi +1 A
0 1 Bi +1 = AC + ABi
C C = 0 Bi +1= ABi

0 Bi 0 C =1 C=B
‘AND’

Bi +1 = A + Bi Bi +1 = A Bi
1 Bi 1 ‘OR’ ‘XOR’

where Bi and Bi+1 are the initial input data and final logical output result in situ stored
in the MTJ, respectively. We can find that the input C is a control signal. In specific,
if C = 1, then Bi+1 = A + Bi , performing an “OR” logic operation; otherwise,
if C = 0, then Bi+1 = ABi , performing an “AND” logic operation. Regarding the
“XOR” logic operation, we can firstly readout Bi from the MTJ and set C = Bi , then
we can get Bi+1 = AB i + ABi , performing a “XOR” logic operation. Figure 9.9
shows the state transition diagram, truth table and Karnaugh map. It should be noted
that all Boolean logic functions can be realized by reconfiguring the input signals.
The logic output Bi+1 is in situ stored in MTJ. Besides, one additional memory-like
read operation is needed to readout the logic output. It can be seen that the logic
operations in the VG-SHE-driven MTJ based spintronic memory is very similar
to the regular write/read operations for a memory data access. This LIM approach
can be implemented either in a typical 2T1MTJ cell array or in a crossbar array
structures owing to the sharing path of the SHE write current. More details can refer
to [49]. Similar LIM concept can also be extended to STT-MRAM by changing the
bit-cell structure [57]. In this LIM approach, the memory can work in either the
memory mode or logic mode, as shown in Fig. 9.10, depending on the application-
oriented requirements. This approach is applicable to any resistive memories with
226 W. Kang et al.

Fig. 9.10 Illustration of the


reconfigurable LIM External processor (CPU)
architecture, in which the
LIM core can be
reconfigured between logic
mode and memory mode

LIM cores

Reconfigurable
Logic mode Memory mode

similar device behaviors. Nevertheless, architecture/software supports are required


to facilitate this approach in practical applications.

9.5 Summaries and Perspectives

In this chapter, we briefly introduce three LIM paradigms in spintronic memories.


By exploiting the intrinsic spintronic memory features, either 3D integration capa-
bility, or peripheral circuitry, or switching behaviors, different LIM approaches can
be employed. Although LIM is potential to address the memory wall or power wall
bottleneck of the von-Neumann architecture through the unity of the storage and
computation in the same die, more efforts, in particular, architecture/software sup-
ports are strongly required to facilitate this approach in practical applications.

Acknowledgements This work was supported by the National Natural Science Founda-
tion of China (61871008 and 61571023), the National Key Technology Program of China
(2017ZX01032101), and the International Mobility Project (B16001 and 2015DFE12880).
9 Spintronic Logic-in-Memory Paradigms and Implementations 227

References

1. N.S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J.S. Hu, M.J. Irwin, M. Kandemir, V.
Narayanan, Leakage current: Moore’s law meets static power. Computer 36(12), 68–75 (2003)
2. W. Kang, Y. Zhang, Z. Wang, J. Klein, C. Chappert, D. Ravelosona, G. Wang, Y. Zhang, W.
Zhao, Spintronics: emerging ultra-low power circuits and systems beyond MOS technology.
ACM J. Emerg. Technol. Comput. Syst. 12(2), 1–42 (2015)
3. W.A. Wulf, S.A. McKee, Hitting the memory wall: implications of the obvious. ACM
SIGARCH Comput. Arch. News 23(1), 20–24 (1995)
4. S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, D. Glasco, GPUS and the future of parallel
computing. IEEE Micro 31(5), 7–17 (2011)
5. S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, Y. Xie, Pinatubo: a processing-in-memory architecture
for bulk bitwise operations in emerging non-volatile memorties, in ACM/EDAC/IEEE Design
Automation Conference (2016), pp. 1–6
6. V. Seshadri, O. Mutlu, The Processing Using Memory Paradigm: In-DRAM Bulk Copy, Ini-
tialization, Bitwise AND and OR, arXiv:1610.09603 (2016)
7. Z. Chowdhury, J.D. Harms, S.K. Khatamifard, M. Zabihi, Y. Lv, A.P. Lyle, S. Sapatnekar,
U.R. Karpuzcu, J.-P. Wang, Efficient in-memory processing using spintronics. IEEE Comput.
Archit. Lett. 17(1), 42–46 (2018)
8. M.A. Zidan, J.P. Strachan, W.D. Lu, The future of electronics based on memristive systems.
Nat. Electron. 1(1), 22–29 (2018)
9. H.S. Stone, A logic-in-memory computer. IEEE Trans. Comput. C-19(1), 73–78 (1970)
10. J. Ahn, S. Yoo, O. Mutlu, K. Choi, PIM-enabled instructions: a low-overhead, locality-aware
processing-in-memory architecture, in 2015 ACM/IEEE 42nd Annual International Symposium
on Computer Architecture (2015), pp. 336–348
11. J. Ahn, S. Hong, S. Yoo, O. Mutlu, K. Choi, A scalable processing-in-memory accelerator
for parallel graph processing, in 2015 ACM/IEEE 42nd Annual International Symposium on
Computer Architecture (2015), pp. 105–117
12. D.G. Elliott, M. Stumm, W.M. Snelgrove, C. Cojocaru, R. McKenzie, Computational RAM:
implementing processors in memory. IEEE Des. Test Comput. 16(1), 32–41 (1999)
13. W. Kang, Z. Wang, Y. Zhang, J.O. Klein, W. Lv, W. Zhao, Spintronic logic design methodology
based on spin Hall effect-driven magnetic tunnel junctions. J. Phys. D Appl. Phys. 49(6), 065008
(2016)
14. D. Fan, S. Angizi, Z. He, In-memory computing with spintronic devices, in 2017 IEEE Com-
puter Society Annual Symposium on VLSI (2017), pp. 683–688
15. W. Kang, C. Zheng, Y. Zhang, D. Ravelosona, W. Lv, W. Zhao, Complementary spintronic
logic with spin Hall effect-driven magnetic tunnel junction. IEEE Trans. Magn. 51(11), 1–4
(2015)
16. P.E. Gaillardon, L. Amaru, A. Siemon, E. Linn, R. Waser, A. Chattopadhyay, G.D. Micheli,
The programmable logic-in-memory (PLiM) computer, in IEEE Design, Automation and Test
in Europe Conference and Exhibition (2016), pp. 427–432
17. R. Nair, S.F. Antao, C. Bertolli, P. Bose, J.R. Brunheroto, T. Chen, C.-Y. Cher, C.H.A. Costa,
J. Doi, C. Evangelinos, B.M. Fleischer, T.W. Fox, D.S. Gallo, L. Grinberg, J.A. Gunnels, A.C.
Jacob, P. Jacob, H.M. Jacobson, T. Karkhanis, C. Kim, J.H. Moreno, J.K. O’Brien, M. Ohmacht,
Y. Park, D.A. Prener, B.S. Rosenburg, K.D. Ryu, O. Sallenave, M.J. Serrano, P.D.M. Siegl, K.
Sugavanam, Z. Sura, Active memory cube: a processing-in-memory architecture for exascale
systems. IBM J. Res. Dev. 59(2/3), 17:1–17:14 (2015)
18. M. Gao, G. Ayers, C. Kozyrakis, Practical near-data processing for in-memory analytics frame-
works, in 2015 International Conference on Parallel Architecture and Compilation (2015),
pp. 113–124
19. K. Chen, S. Li, N. Muralimanohar, J.H. Ahn, J.B. Brockman, N.P. Jouppi, Cacti-3dd:
architecture-level modeling for 3d die-stacked dram main memory, in IEEE Design, Automa-
tion and Test in Europe Conference and Exhibition (2012), pp. 33–38
228 W. Kang et al.

20. A.F. Farahani, J.H. Ahn, K. Morrow, N.S. Kim, NDA: Near-DRAM acceleration architecture
leveraging commodity DRAM devices and standard memory modules, in 2015 IEEE 21st
International Symposium on High Performance Computer Architecture (2015), pp. 283–295
21. H.-S. Philip Wong, S. Salahuddin, Memory leads the way to better computing. Nat. Nanotech-
nol. 10(3), 191–194 (2015)
22. A. Chen, A review of emerging non-volatile memory (NVM) technologies and applications.
Solid-State Electron. 125, 25–38 (2016)
23. J. Borghetti, G.S. Snider, P.J. Kuekes, J.J. Yang, D.R. Stewart, R.S. Williams, Memristive
switches enable stateful logic operations via material implication. Nature 464(7290), 873–876
(2010)
24. P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, PRIME: a novel processing-in-
memory architecture for neural network computation in ReRAM-based main memory. ACM
SIGARCH Comput. Arch. News 44(3), 27–39 (2016)
25. L. Wang, W. Kang, F. Ebrahimi, X. Li, Y. Huang, C. Zhao, K.L. Wang, W. Zhao, Voltage-
controlled magnetic tunnel junctions for processing-in-memory implementation. IEEE Electron
Device Lett. 39(3), 440–443 (2018)
26. N. Locatelli, V. Cros, J. Grollier, Spin-torque building blocks. Nat. Mater. 13(1), 11–20 (2014)
27. H. Zhang, G. Chen, B.C. Ooi, K.-L. Tan, M. Zhang, In-memory big data management and
processing: a survey. IEEE Trans. Knowl. Data Eng. 27(7), 1920–1948 (2015)
28. Z. Wang, S. Joshi, S. Savel’ev, W. Song, R. Midya, Y. Li, M. Rao, P. Yan, S. Asapu, Y. Zhuo,
H. Jiang, P. Lin, C. Li, J.H. Yoon, N.K. Upadhyay, J. Zhang, M. Hu, J.P. Strachan, M. Barnell,
Q. Wu, H. Wu, R.S. Williams, Q. Xia, J.J. Yang, Fully memristive neural networks for pattern
classification with unsupervised learning. Nat. Electron. 1(2), 137–145 (2018)
29. E. Linn, R. Rosezin, S. Tappertzhofen, U. Bottger, R. Waser, Beyond von Neumann—logic
operations in passive crossbar arrays alongside memory operations. Nanotechnology 23(30),
305205 (2012)
30. S. Gao, G. Yang, B. Cui, S. Wang, F. Zeng, C. Song, F. Pan, Realisation of all 16 Boolean logic
functions in a single magnetoresistance memory cell. Nanoscale 8(25), 12819–12825 (2016)
31. W. Zhao, E. Belhaire, C. Chappert, P. Mazoyer, Spin transfer torque (STT)-MRAM-based
runtime reconfiguration FPGA circuit. ACM Trans. Embed. Comput. Syst. 9(2), 14:1–14:16
(2009)
32. C.J. Lin, S.H. Kang, Y.J. Wang, K. Lee, X. Zhu, W.C. Chen, X. Li, W.N. Hsu, Y.C. Kao, M.T.
Liu, W.C. Chen, Y. Lin, M. Nowak, N. Yu, L. Tran, 45 nm low power CMOS logic compatible
embedded STT MRAM utilizing a reverse-connection 1T/1MTJ Cell, in IEEE International
Electron Devices Meeting (2009), pp. 1–4
33. E. Deng, Design and development of low-power and reliable logic circuits based on spin-
transfer torque magnetic tunnel junctions, Ph.D. dissertation, Grenoble Alpes University,
Grenoble, France (2017)
34. Y. Gang, W. Zhao, J.-O. Klein, C. Chappert, P. Mazoyer, A high-reliability, low-power magnetic
full adder. IEEE Trans. Magn. 47(11), 4611–4616 (2011)
35. E. Deng, Y. Zhang, W. Kang, B. Dieny, J.-O. Klein, G. Prenat, W. Zhao, Synchronous 8-bit
non-volatile full-adder based on spin transfer torque magnetic tunnel junction. IEEE Trans.
Circuits Syst. I Regul. Pap. 62(7), 1757–1765 (2015)
36. A. Mochizuki, H. Kimura, M. Ibuki, T. Hanyu, TMR-based logic-in-memory circuit for low-
power VLSI. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E88-A(6), 1408–1415
(2005)
37. W. Zhao, E. Belhaire, C. Chappert, F. Jacquet, P. Mazoyer, New non-volatile logic based on
spin-MTJ. Nanotechnology 205(6), 1373–1377 (2008)
38. S. Onkaraiah, M. Reyboz, F. Clermidy, J.-M. Portal, M. Bocquet, C. Muller, Hraziia, C. Anghel,
A. Amara, Bipolar ReRAM based non-volatile flip-flops for low-power architectures, in IEEE
International New Circuits and Systems Conference (2012), pp. 417–420
39. D. Chabi, W. Zhao, E. Deng, Y. Zhang, N.B. Romdhane, J.-O. Klein, C. Chapert, Ultra low
power magnetic flip-flop based on checkpointing/power gating and self-enable mechanisms.
IEEE Trans. Circuits Syst. I Regul. Pap. 61(6), 1755–1765 (2014)
9 Spintronic Logic-in-Memory Paradigms and Implementations 229

40. W. Zhao, M. Moreau, E. Deng, Y. Zhang, J.-M. Portal, J.-O. Klein, M. Bocquet, H. Aziza,
D. Deleruyelle, C. Muller, D. Querlioz, N.B. Romdhane, D. Ravelosona, C. Chappert, Syn-
chronous non-volatile logic gate design based on resistive switching memories. IEEE Trans.
Circuits Syst. I Regul. Pap. 61(2), 443–454 (2014)
41. E. Deng, Y. Zhang, J.-O. Klein, D. Ravelsona, C. Chappert, W. Zhao, Low power magnetic
full-adder based on spin transfer torque MRAM. IEEE Trans. Magn. 49(9), 4982–4987 (2013)
42. S. Matsunaga, J. Hayakawa, S. Ikeda, K. Miura, H. Hasegawa, T. Endoh, H. Ohno, T. Hanyu,
Fabrication of a nonvolatile full adder based on logic-in-memory architecture using magnetic
tunnel junctions. Appl. Phys. Express 1(9), 091301 (2008)
43. T. Hanyu, T. Endoh, D. Suzuki, H. Koike, Y. Ma, N. Onizawa, M. Natsui, S. Ikeda, H. Ohno,
Standby-power-free integrated circuits using MTJ-based VLSI computing. Proc. IEEE 104(10),
1844–1863 (2016)
44. W. Kang, H. Wang, Z. Wang, Y. Zhang, W. Zhao, In-memory processing paradigm for bitwise
logic operations in STT-MRAM. IEEE Trans. Magn. 53(11), 6202404 (2017)
45. S. Jain, A. Ranjan, K. Roy, A. Raghunathan, Computing in memory with spin-transfer torque
magnetic RAM. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(3), 470–483 (2018)
46. Z. He, S. Angizi, D. Fan, Exploring STT-MRAM based in-memory computing paradigm with
application of image edge extraction, in IEEE International Conference on Computer Design
(2017), pp. 439–446
47. Z. He, S. Angizi, F. Parveen, D. Fan, High performance and energy-efficient in-memory comput-
ing architecture based on SOT-MRAM, in IEEE/ACM International Symposium on Nanoscale
(2017), pp. 97–102
48. D. Fan, Z. He, S. Angizi, Leveraging spintronic devices for ultra-low power in-memory com-
puting logic and neural network, in 2017 IEEE 60th International Midwest Symposium on
Circuits and Systems (2017), pp. 1109–1112
49. H. Zhang, W. Kang, L. Wang, K.L. Wang, W. Zhao, Stateful reconfigurable logic via a single
voltage-gated spin Hall effect driven magnetic tunnel junction in a spintronic memory. IEEE
Trans. Electron Devices 64(10), 4295–4301 (2017)
50. W. Kang, H. Zhang, P. Ouyang, Y. Zhang, W. Zhao, Programmable stateful in-memory com-
puting paradigm via a single resistive device, in IEEE International Conference on Computer
Design (2017), pp. 613–616
51. R.A. Buhrman, D.C. Ralph, C.-F. Pai, L. Liu, Electrically gated three-terminal circuits and
devices based on spin hall torque effects in magnetic nanostructures apparatus, methods and
applications, U.S. Patent, no. US9230626B2, March 2016
52. H. Yoda, N. Shimomura, Y. Ohsawa, S. Shirotori, Y. Kato, T. Inokuchi, Y. Kamiguchi,
B. Altansargai, Y. Saito, K. Koi, H. Sugiyama, S. Oikawa, M. Shimizu, M. Ishikawa, K.
Ikegami, A. Kurobe, Voltage-control spintronics memory (VoCSM) having potentials of ultra-
low energy-consumption and high-density, in IEEE International Electron Devices Meeting
(2016), pp. 27.6.1–27.6.4
53. J.E. Hirsch, Spin Hall effect. Phys. Rev. Lett. 83(9), 1834–1837 (1999)
54. L. Liu, C.F. Pai, Y. Li, H.W. Tseng, D.C. Ralph, R.A. Buhrman, Spin-torque switching with
the giant spin Hall effect of tantalum. Science 336(6081), 555–558 (2012)
55. W.G. Wang, M. Li, S. Hageman, C.L. Chien, Electric-field-assisted switching in magnetic
tunnel junctions. Nat. Mater. 11(1), 64–68 (2012)
56. W. Kang, Y. Ran, Y. Zhang, W. Lv, W. Zhao, Modeling and exploration of the voltage con-
trolled magnetic anisotropy effect for the next-generation low-power and high-speed MRAM
applications. IEEE Trans. Nanotechnol. 16(3), 387–395 (2017)
57. H. Zhang, W. Kang, K. Cao, B. Wu, Y. Zhang, W. Zhao, Spintronic processing unit in spin trans-
fer torque magnetic random access memory. IEEE Trans. Electron Devices 66(4), 2017–2022
(2019)

You might also like