0% found this document useful (0 votes)
14 views15 pages

A High-Performance Solid-State Disk With Double-Da

This document presents a novel solid-state disk (SSD) architecture that employs a double-data-rate synchronous NAND flash interface to enhance read and write performance without requiring additional pins, ensuring backward compatibility. The proposed architecture demonstrates significant performance improvements, with read and write speeds up to 2.76 times and 2.45 times faster than conventional designs, respectively, particularly when combined with techniques like way interleaving. The study evaluates various SSD designs, highlighting the effectiveness of the new architecture in reducing latency and improving energy consumption.

Uploaded by

BigMuchroom s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views15 pages

A High-Performance Solid-State Disk With Double-Da

This document presents a novel solid-state disk (SSD) architecture that employs a double-data-rate synchronous NAND flash interface to enhance read and write performance without requiring additional pins, ensuring backward compatibility. The proposed architecture demonstrates significant performance improvements, with read and write speeds up to 2.76 times and 2.45 times faster than conventional designs, respectively, particularly when combined with techniques like way interleaving. The study evaluates various SSD designs, highlighting the effectiveness of the new architecture in reducing latency and improving energy consumption.

Uploaded by

BigMuchroom s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/272195445

A High-Performance Solid-State Disk with Double-Data-Rate NAND Flash


Memory

Article · February 2015


Source: arXiv

CITATIONS READS
2 2,813

6 authors, including:

Dong Kim Sungroh Yoon


University of Birmingham Seoul National University
13 PUBLICATIONS 78 CITATIONS 335 PUBLICATIONS 8,330 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Sungroh Yoon on 26 June 2016.

The user has requested enhancement of the downloaded file.


DRAFT 1

A High-Performance Solid-State Disk with


Double-Data-Rate NAND Flash Memory
Eui-Young Chung, Member, IEEE, Chang-Il Son, Kwanhu Bang, Student Member, IEEE,
Dong Kim, Soong-Mann Shin, and Sungroh Yoon, Senior Member, IEEE

Abstract—We propose a novel solid-state disk (SSD) architecture that utilizes a double-data-rate synchronous NAND flash interface
for improving read and write performance. Unlike the conventional design, the data transfer rate in the proposed design is doubled in
harmony with synchronous signaling. The new architecture does not require any extra pins with respect to the conventional architecture,
arXiv:1502.02239v1 [cs.AR] 8 Feb 2015

thereby guaranteeing backward compatibility. For performance evaluation, we simulated various SSD designs that adopt the proposed
architecture and measured their performance in terms of read/write bandwidths and energy consumption. Both NAND flash cell types,
namely single-level cells (SLCs) and multi-level cells (MLCs), were considered. In the experiments using SLC-type NAND flash chips,
the read and write speeds of the proposed architecture were 1.65–2.76 times and 1.09–2.45 times faster than those of the conventional
architecture, respectively. Similar improvements were observed for the MLC-based architectures tested. It was particularly effective to
combine the proposed architecture with the way-interleaving technique that multiplexes the data channel between the controller and
each flash chip. For a reasonably high degree of way interleaving, the read/write performance and the energy consumption of our
approach were notably better than those of the conventional design.

Index Terms—Solid-state disk (SSD), Double-data rate (DDR), NAND flash memory, Interleaving

1 I NTRODUCTION controller, which manages the data transfer between the


NAND flash chips and the host machine the SSD is
N AND-flash-based solid-state disks (SSDs) are re-
placing hard disk drives (HDDs), the mass storage
device of choice for many decades, not only in high-
attached to. The system-level read/write speed of an
SSD is often orders of magnitude faster than that of
HDDs, but this is not because the individual NAND
end servers but also in mainstream PCs and in low-
flash chips inside the SSD are that fast. In fact, a major
end mobile internet devices (MIDs). The compelling
performance bottleneck in an SSD may occur due to the
reason for this change can be attributed to the absence of
latency of accessing NAND flash memory. For instance,
mechanical moving parts in SSDs; this fact can substan-
the time to program a flash cell is normally in the range
tially enhance key characteristics of mass storage devices
of hundreds of microseconds, which is several orders
such as read/write performance, power consumption,
of magnitude greater than the typical clock-cycle time
weights, form factors, reliability, shock resistance, and
of the SSD controller. Thus, the SSD controller should
many others. In particular, the improved read/write
frequently slow down or be idle in order to keep pace
performance of SSDs is expected to narrow the so-called
with NAND flash memory, thereby incurring a perfor-
CPU-IO performance gap [1], which has been a long-
mance loss. SSDs can be faster than HDDs because of
standing problem for accelerating computer systems.
the various techniques employed to hide and/or reduce
Due to the recent advent of multi-core CPUs, the CPU-
the latency of sluggish NAND flash memory, as will be
IO performance gap would become even wider without
surveyed shortly.
a breakthrough in IO systems. Thus, the read/write per-
The NAND flash access time issue has become more
formance has become one of the most important metrics
critical due to the advent of multi-level-cell (MLC) flash
to determine the overall merit of a storage device.
memory. A traditional NAND flash chip can store only
The two major components of a typical NAND-based one bit per cell and is called single-level-cell (SLC)
SSD are the following: i) a number of NAND flash flash memory. In contrast, MLC flash memory can store
memory chips and ii) a control circuitry called the SSD multiple bits per cell. Thus, MLC flash memory is more
cost-effective, since it demands much less die space than
• Eui-Young Chung and Kwanhu Bang are with School of Electrical SLC flash memory, in order to integrate the same ca-
and Electronic Engineering, Yonsei University, Seoul, Korea. E-mail:
{eychung, lamar49}@yonsei.ac.kr
pacity using the same process technology. Unfortunately,
• Chang-Il Son, Dong Kim and Soong-Mann Shin are with Flash Solution the MLC implementation inevitably increases the access
R&D Center, Samsung Electronics, Hwasung, Kyungki, Korea. E-mail: time. For instance, it is known that the cell program
{cison, dong.kim, sm1978.shin}@samsung.com
• Sungroh Yoon (corresponding author) is with Department of Electrical and
time of MLC flash memory is approximately three times
Computer Engineering, Seoul National University, Seoul, Korea. E-mail: larger than that of SLC flash memory. Nevertheless, the
[email protected] adoption of MLC flash memory will rapidly grow, since
Manuscript drated June 4, 2009. the MLC implementation can significantly lower the per-
2 DRAFT

bit cost, which is still much higher than that of HDDs. SSD
NAND flash chip
The core competency of SSDs over the HDDs can Controller
thus be obtained by trading off the access time and the Processor
XY
Decoder
Cell Array

cost of NAND flash memory in an effective manner. tPROG tR


Host
This point was recognized early, and many techniques HOST _IF ROM Page Register
Y Gating
have been proposed to alleviate the access time issue. tBYTE
Control
As detailed in Section 2.3, examples include way inter- RAM NAND_IF Logic IO Buffers & Latches
leaving, channel striping, and caching. Strictly speaking, tREA
these techniques are more for hiding the NAND flash
access latency, rather than reducing it. There exist other Fig. 1: Block diagram of a typical SSD.
approaches targeting on actual reduction of the latency.
A key idea of these techniques is to replace the conven-
tional asynchronous NAND flash interface scheme by controller to manage the data transfer between the host
a synchronous one, an idea that stems from the history machine and the NAND flash chips. The controller con-
of DRAM: the initial asynchronous DRAM interface was tains various components such as a processor, random
later replaced by faster synchronous interfaces. However, access memory (RAM), read only memory (ROM), a host
the limitation of these approaches is that they require interface, and a NAND interface. The processor governs
additional pins, thereby causing area overhead and in- the controller by executing the firmware residing in
compatibility with the traditional components. the ROM chip. Some notable tasks of the processor
Our approach proposed in this paper belongs to the include wear leveling and address translation, as will be
category of techniques to reduce the latency itself. More explained in Section 2.2.1. The NAND interface labeled
precisely, the contributions of our work are two-fold. NAND IF in Fig. 1 is to communicate with the NAND
First, we propose a novel SSD architecture that utilizes a flash chips.
double-data-rate (DDR) synchronous NAND flash inter- Each NAND flash memory chip in the SSD architec-
face for improving read and write performance. Unlike ture is composed of a cell array, a page register, an XY
the conventional design, the data transfer rate in the pro- decoder, a control logic, IO buffers, and latches. The cell
posed design is doubled in harmony with synchronous array stores the entire set of data, while the page register
signaling. Furthermore, the new architecture does not temporarily stores one page of the data being requested
require any extra pins with respect to the conventional for read or write. The XY decoder decodes the address
architecture, thereby guaranteeing backward compatibil- issued by the controller, and the control logic manages
ity. Second, we thoroughly validate the performance of the interface with the controller. The data transfer time
our approach by simulating various SSD designs that from the cell array to the page register is defined as
adopt the proposed architecture and by measuring their tR , and the time for the reverse action (i.e. the time to
read and write bandwidths as well as energy consump- transfer data from the page register to the cell array)
tion. Moreover, we show how the proposed architecture is called the page program time or tP ROG . Typically,
is combined with the two most popular latency-hiding tP ROG is much larger than tR . The data transfer time
techniques, namely way interleaving and channel strip- between the page register and IO buffer is referred to as
ing, for their synergistic effects on overall performance tBY T E . Finally, tREA is the data transfer time between
at SSD-level. For realistic results, we consider both SLC the IO buffer and IO pads. More details on these timing
and MLC NAND flash memory. parameters will be presented in Table 1 in Section 3.
The rest of this paper is organized as follows. Section 2
introduces the basics of SSD architectures and discusses 2.2 Options for Improving SSD Performance
possible options for enhancing SSD performance. This
section also provides a brief review on previous ap- For the SSD architecture shown in Fig. 1, the opportuni-
proaches for resolving the latency issue in NAND flash ties for performance improvements can be summarized
memory. In Section 3, we describe the conventional SSD as follows: i) to enhance the performance of NAND
architecture that uses the single-date-rate asynchronous flash cells, ii) to optimize the performance of the SSD
NAND flash interface. The proposed SSD architecture controller, iii) to use a faster interface between the SSD
that utilizes the new DDR synchronous NAND flash and its host, iv) to accelerate the interface between the
interface is detailed in Section 4. Finally, we provide SSD controller and the NAND flash chips, and v) the
our experimental results in Section 5 followed by a mixture of these options. We survey the techniques based
conclusion in Section 6. on options ii)–iv). Option i) is beyond the scope of
this paper and will not be discussed further; interested
readers are directed to [2], [3], [4], [5], [6], [7], [8].
2 P RELIMINARIES AND R ELATED WORK
2.1 Typical SSD Architecture 2.2.1 Optimizing SSD Controller
Fig. 1 shows the architecture of a typical SSD, which is This may be the option that has been most actively
composed of multiple NAND flash memory chips and a studied. From the hardware perspective, one of the most
CHUNG et al.: A HIGH-PERFORMANCE SOLID-STATE DISK WITH DOUBLE-DATA-RATE NAND FLASH MEMORY 3

Controller NAND flash chips machine happens to be found in the cache buffer, we
NAND CH1 Way1 can completely eliminate the data access time to NAND
NAND_IF
Processor & ECC
NAND CH1 Way2 flash memory.
NAND CH1 Way3
Logic NAND CH1 Way4 Refer to Sections 2.3.1 and 2.3.2 for a brief survey of
ROM
NAND_IF NAND CH2 Way1 the existing approaches that belong to this category.
RAM NAND CH2 Way2
& ECC NAND CH2 Way3
Region for FW
Region for CH1
Logic NAND CH2 Way4 2.2.2 Improving Host Interface
Region for CH2
NAND_IF NAND CH3 Way1 This option is to increase the bandwidth between the
Region for CH3 NAND CH3 Way2
Region for CH4
& ECC
NAND CH3 Way3 SSD and its host machine. Currently, SSDs are attached
Logic NAND CH3 Way4 to the host machine via legacy interfaces inherited from
HOST _IF NAND CH4 Way1 HDDs such as parallel advanced technology attachment
NAND_IF
NAND CH4 Way2
& ECC
DRAM_IF NAND CH4 Way3 (PATA) and serial-ATA (SATA) [13]. To achieve higher
Logic
NAND CH4 Way4
performance with less pin counts, SATA is rapidly re-
placing PATA these days both for HDDs and SSDs. In
Fig. 2: An SSD architecture with 4 channels and 4 ways
addition, to handle properly the increased bandwidth of
per channel.
SSDs, alternative high-speed interfaces such as periph-
eral component interconnect express (PCIe) have been
tried for interfacing SSDs. Recently, it was proposed in
frequently used techniques is to increase data through- [14] to attach SSDs to the North Bridge chipset using
put by parallelizing the data paths between the controller the DRAM interface, instead of using the South Bridge
and NAND flash chips. Such paths are called channels, chipset in which the SATA and PATA controllers reside.
and there are largely two methods for the paralleliza-
tion. One is called channel striping, which means using 2.2.3 Accelerating NAND Flash Interface
multiple channels in the NAND flash interface. The This is to increase the bandwidth between the controller
other is called way interleaving, and this is to multiplex and each NAND flash memory chip. Even though the
each channel to send data in a round-robin fashion. By objective of this option is similar to that of channel strip-
exploiting these techniques, it is possible to hide much ing or way interleaving, this option is more aggressive
of the latency of NAND flash memory. in the sense that the read and write bandwidths can be
Fig. 2 is an example of the SSD architecture adopting improved by reducing the latency directly, rather than
the techniques of channel striping and way interleaving hiding it. A key technique in this category is to improve
simultaneously. The number of channels and ways in the NAND flash interface scheme in a synchronous
this example are both four. Of note is that channel fashion. Section 2.3.3 presents more details of existing
striping is often more costly than way interleaving, techniques for accelerating NAND flash interfaces.
since each channel requires a NAND interface block
and an error correction code (ECC) block. The ECC
2.3 Related Work
block is essential for data reliability, especially when
the MLC flash is used. Another area penalty of multi- 2.3.1 Hiding the Latency of NAND Flash Memory
channel design comes from increased pin counts. Each The effect of channel striping and way interleaving was
channel requires dedicated pins to communicate with the extensively studied in [15], which used a 2-channel, 4-
dedicated NAND flash memory chips. For this reason, way-interleaving interface scheme with a software ar-
the number of channels should carefully be selected in chitecture adopting a hybrid-mapping algorithm. The
order to achieve the required system performance within proposed system outperformed the compared HDD by
the area budget. 77%. The improvement was mainly due to the increased
Another performance improvement technique from parallelism and the interleaved accesses when program-
the controller perspective is to optimize the software ming NAND flash memory. However, the limitations
called flash translation layer (FTL) [9], [10], [11]. FTL runs of this approach include area overhead and compli-
on the processor of an SSD controller and performs cated controller design due to the increased number
mapping between logical and physical addresses and of channels. Other approaches to latency-hiding include
also handles important housekeeping tasks such as wear the techniques proposed in [16], [17], where DRAM
leveling [12] and garbage collection. Wear leveling is was used as the cache buffer for NAND flash memory.
to use all the flash cells in a chip as uniformly as When a cache hit occurs, the data access time is solely
possible and plays a critical role to maintain the initial determined by the DRAM access time, which is much
performance and capacity of an SSD over time, since smaller than the flash access time.
the lifetime of a flash cell is directly limited by its write
frequencies. 2.3.2 Optimizing the Firmware of SSD Controller
Besides, in most commercially available SSDs, DRAM The techniques in this category aim at enhancing the
is used as a cache buffer to hide the long access latency of SSD performance by reducing the data transfer size,
NAND flash memory. If the data requested by the host operating time, and the number of extra operations
4 DRAFT

required for wear leveling. The technique presented in 3 C ONVENTIONAL A SYNCHRONOUS NAND
[18], [19], [20] compresses the data from the host unit to F LASH I NTERFACE FOR S OLID -S TATE D ISKS
save the storage space in NAND flash memory and to
The overall structure of a typical SSD was explained in
reduce the data transfer time from the controller to flash
Section 2. In this section, we present additional details
chips. However, this method may incur extra time and
on the conventional method for interfacing the controller
area overheads for data (de)compression. The hybrid-
and the NAND flash chips in SSDs. The material in this
mapping technique proposed in [9] aimed at improving
section is crucial for understanding the new interface
the write speed by introducing two types of logical
architecture proposed in Section 4. The major difference
blocks called data blocks and log blocks. The number of
between the two architectures lies in the controller-flash
log blocks is much smaller than that of data blocks, and
interface; the conventional interface uses a asynchronous
data is always written to log blocks first. When all log
single-data-rate scheme, whereas the proposed design
blocks are used up, the FTL moves the data from log
utilizes a synchronous double-data-rate scheme.
blocks to data blocks. This technique may incur extra
computation overhead but can be beneficial for quick
search owing to the small number of log blocks. The 3.1 Block Diagram and Key Components
techniques introduced in [10], [11], [21], [22] can reduce Fig. 3 shows the conventional asynchronous interface ar-
the number of erase operations by using a page-map chitecture. Note that only the NAND IF block is drawn
cache and smart mapping strategies; it was shown that inside the controller block for clarity, although there
the system performance can be enhanced by reducing exist additional blocks, as shown in Fig. 1.The NAND IF
the number of erase and garbage collection operations. block and the NAND flash chip communicate over three
types of ports. The upper two ports are for transferring
data strobe signals, and the lower one is for exchanging
all the other control signals as well as data.
2.3.3 Improving Controller-Flash Interface Inside the NAND IF block, there are two blocks called
generate write (Gen W) and generate read (Gen R). The
In [23], the authors introduced a synchronous NAND signal to control writes is called write enable bar (WEB)
flash interface using a signal called data valid strobe and is generated by the Gen W block. The read control
(DVS). This interface improved the sensitivity to the signal is named read enable bar (REB) and is produced by
process, voltage, and temperature (PVT) as well as the the Gen W block. WEB and REB are sent to the NAND
read performance by isolating the timing of the con- flash chip via the upper two ports of the interface. The
troller from that of the NAND flash memory. However, D CON block is to delay the clock (CLK) so that data
this approach exploited only one edge of each clock transfers at the interface can fulfill any given timing
signal, producing limited performance improvements. specifications. The blocks called WFIFO and RFIFO are
The focus of this work was more on desensitizing PVT for buffering data from and to the host, respectively.
variations rather than on boosting read and write per- The IO latches inside the flash chip include timing-
formance. critical parts called write latch (WLAT) and read latch
(RLAT). WLAT temporarily stores the data from the con-
Recently, some leading companies in the SSD business troller to the page register, whereas RLAT temporarily
organized an initiative called open NAND flash interface stores the data from the page register to the controller.
(ONFI) and proposed a DDR flash interface scheme,
whose specification is available at [24]. Additionally,
3.2 Timing Parameters
the authors in [25] proposed a similar concept along
with a new SSD architecture. However, these approaches To explain the write and read operations of the SSD in-
require additional pins, thus causing compatibility issues terface architecture in Sections 3.3 and 3.4, we first show
and area overhead. Furthermore, no quantitative analy- in Table 1 a number of important timing parameters for
sis was performed to prove the effectiveness of these the interface building blocks. In the table, note that the
approaches and to show the impact of DDR interface first eight parameters are common for the conventional
schemes on the SSD performance. and the proposed interfaces. The next four are only for
the conventional architecture; the rest are only for the
Our work presented in this paper belongs to the
proposed architecture detailed in Section 4. Additional
category of techniques to accelerate the interface be-
timing parameters of NAND flash chips themselves are
tween SSD controller and NAND flash chips. Unlike
available in [26], [27], [28].
the aforementioned approaches, our DDR synchronous
interface scheme provides pin-level compatibility with
the traditional NAND flash memory interface. Moreover, 3.3 Write Operation and Timing
we evaluate the effect of the proposed technique quan- Fig. 4(a) shows the write timing diagrams of the con-
titatively with respect to various architectural choices ventional NAND flash memory interface. The controller
(e.g. the number of channels and ways) from the SSD asserts WEB and issues the first write command (CMD)
perspective. to the flash chip in order to initiate a write operation.
CHUNG et al.: A HIGH-PERFORMANCE SOLID-STATE DISK WITH DOUBLE-DATA-RATE NAND FLASH MEMORY 5

SSD Board
Controller NAND flash chip
tOUT
Cell Array
NAND_IF
tPROG tR
CLK WEB
F/F Gen_W
(tP) Control
Page Register
F/F REB Logic
Gen_R
tBYTE
WDATA tDS/tDH tBYTE
WFIFO WLAT
IO
RDATA RLAT
RFIFO
D_CON DCLK
IO Latches

tD tS/tH tIN tREA

Fig. 3: Block diagram of the NAND flash memory interface in the conventional SSD architecture.

TABLE 1: Timing parameters for the conventional and proposed interface architectures.
Parameter Conventional (Fig. 3) Proposed (Fig. 5)
tP Clock (CLK) period
tD Delay amount of CLK by D CON (i.e. difference between CLK and DCLK); tD = α · tP , where 0 ≤ α ≤ 1/2
tS /tH Setup/hold time of WFIFO and RFIFO
tR Data fetch time (from Cell Array to Page Register)
tP ROG Program time (from Page Register to Cell Array)
tBY T E Data transfer time between Page Register and WLAT/RLAT
tW C Write cycle time (i.e. one cycle of WEB)
tRC Read cycle time (i.e. one cycle of REB)
tIN Data propagation time between the IO pad
of the controller and WFIFO/RFIFO
tOU T Signal propagation time from FFs of the controller
to the strobe pads of NAND flash memory N/A
tDS /tDH Setup/hold time of IO signals with respect to WEB
tREA Data transfer time from RLAT to
the IO pad of the controller
tDIF F Difference between the arrival time of DVS at RFIFO
and the arrival time of IO in the NAND flash at RFIFO
tDLL Time delay by DLL as defined in Eq. (2)
tRW EBD Propagation delay of RWEB from
N/A the strobe port of NAND flash memory to DLL
tIOS /tIOH Setup/hold time of IO signals with respect to DVS
tIOD Data propagation delay from RLAT to the IO pad
of NAND flash memory
tRW C One cycle of RWEB; replaces tRC and tW C

The destination addresses are then sent to the flash chip to the flash chip; the delays of the control and data
followed by a series of data to be written to the page paths are almost identical. The conventional interface
register through WLAT at every tW C , the period of WEB. operates synchronously in the write mode in the sense
Finally, the controller issues a program CMD to transfer that transfers are synchronized to the periodic WEB
the data in the page register to the cell arrays of the flash signal under the timing constraints set by tDS and tDH .
chip. During the program phase, the flash memory chip The data transfer rate in the write mode can therefore be
enters the busy state and cannot be interrupted until the improved by increasing the frequency of WEB. However,
end of the program phase. This time duration is defined the conventional interface is not considered synchronous
as tP ROG and is normally very long. due to the asynchronous read mode, as will be explained
next.
Note that, in the write mode, both control (i.e. WEB)
and data are concurrently transferred from the controller
6 DRAFT

tWC

WEB tPROG
tDS tDH tDS tDH tDS tDH

IO[7:0]

CMD Address Writing Data CMD

(a) Write mode

WEB
tP

CLK
tD
tRC
REB
tR (Busy state)
tREA
DCLK

tDS tDH tDS tDH tS tH

IO[7:0]

CMD Address CMD Reading Data

(b) Read mode

Fig. 4: The timing diagrams of the conventional asynchronous NAND flash memory interface.

3.4 Read Operation and Timing to the flash chip, and then the data transfer occurs in
The timing diagrams for the read operation are shown in the opposite direction. Consequently, a single read cycle
Fig. 4 (b). After issuing the first read CMD followed by should be determined by the sum of the propagation
the destination address, the second read CMD is issued delays of REB and data, unlike the write mode in which
to the flash chip. It then enters the busy state for fetching a write cycle can be set by the maximum of the two
data from the cell arrays to the page register. This data delays. For this reason, tRC is normally longer than
fetching time is defined as tR , which is much shorter tW C , although the specification of commercial NAND
than tP ROG . Thus, the data transfer time between the cell flash memory usually lists identical timing parameters
arrays and the page register is not as critical in the read for convenience. The new interface architecture proposed
mode as it was in the write mode. At the completion in the next section focuses on reducing the read cycle
of the fetch, the flash chip enters the ready state, and time in order to enhance read performance.
the controller periodically asserts REB to the flash chip
with the period of tRC . For each REB cycle, the control 4 P ROPOSED DDR S YNCHRONOUS NAND
logic inside the flash chip instructs a single data transfer F LASH I NTERFACE FOR S OLID -S TATE D ISKS
from the page register to RLAT within tBY T E , and the In this section, we provide the details of the proposed
data reach the IO ports of the controller within tREA . NAND flash interface for improving SSD performance.
The controller then fetches the data into RFIFO at the This new architecture focuses on enhancing the data
positive edge of DCLK, a delayed version of CLK by tD . throughput between NAND flash memory chips and
More precisely, tD is defined as the SSD controller. To this end, the proposed scheme
tD = α · tP , (1) operates in a synchronous manner for both read and
write modes and supports double-data-rate transfers.
where 0 ≤ α ≤ 12 . Note that DCLK is used to satisfy As highlighted in Section 3, a major performance bot-
the setup time constraint imposed on RFIFO. Without tleneck in the conventional NAND flash memory is the
DCLK, the system may easily violate the timing con- serialized, opposite-directional propagation of control
straint due to the variations of tIN , tOUT , and tREA . and data in the read mode. The propose interface breaks
Thus, each operation of propagating REB and fetching this serialized propagation paths into two smaller ones
data is allowed to take at most tRC + tD , instead of tRC . — one for control and the other for data — and isolate
It is critical to notice the following: In the read mode them from the perspective of timing. More precisely, the
of the conventional interface, the control (i.e. REB) and REB control is generated by CLK and is propagated just
data cannot be propagated concurrently, unlike the write as in the conventional architecture. On the other hand,
mode. That is, REB is first propagated from the controller the data is fetched from the flash chip to the controller
CHUNG et al.: A HIGH-PERFORMANCE SOLID-STATE DISK WITH DOUBLE-DATA-RATE NAND FLASH MEMORY 7

SSD board
Controller NAND flash chip
NAND_IF Cell Array
RWEB tPROG tR
CLK F/F Gen_W
(tP) tRWEBD Control Page Register
F/F Gen_R DVS Logic
M tBYTE
D_CON U MUX
X tBYTE
WDATA M tIOS/tIOH WLAT
WFIFO0 DLL
U WLAT
WFIFO1 X
M RLAT
M tDIFF IO U
RDATA RFIFO0
U X RLAT
X RFIFO1
IO Latches
tIOD

tS/tH

Fig. 5: Block diagram of the proposed double-data-rate synchronous NAND flash memory interface.

in synchronization with a new control signal named data early, REB has been replaced by DVS for synchronous
valid strobe (DVS), as depicted in Fig. 5. DVS is a data operations, and the FIFOs and latches have been du-
strobe asserted by the flash chip and can be considered plicated for DDR operations. The multiplexers are used
as a data clock whose edges indicate stable points for inside the NAND flash chip in order to select WLAT for
data fetching. writes and RLAT for reads, depending on the edge type
Introducing DVS is for the synchronous read opera- of RWEB. Now that RWEB is commonly used for both
tion. To support DDR operation, we duplicate the RFIFO read and write modes, we do not need to distinguish
and WFIFO buffers inside the controller and the RLAT tW C and tRC and thus use tRW C as the common timing
and WLAT latches inside the flash chip. In the controller, parameter representing tW C and tRC . The D CON and
one pair of RFIFO and WFIFO is dedicated to the rising Gen R blocks are not required in the proposed interface
edge of CLK, and the other pair to the falling edge of design but are included in the design shown in Fig. 5
CLK; in the flash chip, one pair of RLAT and WLAT is for guaranteeing backward compatibility.
for the rising edge of DVS, and the other pair for the Note that the timing-critical path in the read mode
falling edge for DVS. is broken into two parts in the proposed design. One is
The notion of DVS was first introduced in [23], but the path for propagating RWEB, and the other is the data
the purpose of that work was not to increase the data path from the NAND flash memory to the controller. The
bandwidth but to desensitize the PVT variations as delay of the first path determines tRW C , since RWEB
discussed in Section 2.3. In contrast to [23], the proposed propagates through the same path in the write mode.
design can enhance the overall read/write performance Thus, tRW C is identical to tW C , rather than tRC of the
of an SSD by allowing double-data-rate data transfers conventional NAND flash memory. The delay of the
between the controller and flash memory. We compare data path in the proposed architecture is shorter than
the performance of the interface introduced in [23] and tRC of the conventional architecture. This is because
that of the proposed architecture in Section 5. the propagation delay of RWEB does not need to be
The proposed scheme differs from the popular DDR considered for calculating the data propagation delay.
DRAM interface in that the proposed architecture does Consequently, the proposed interface can provide higher
not require an additional memory clock, since REB is data throughput than the conventional one can.
replaced by the bidirectional DVS signal. Replacing REB To generate DVS at a stable data point, we use a delay-
by DVS, rather than adding an extra pin, is beneficial for locked loop (DLL) circuit. DLL is triggered by the data
maintaining backward compatibility with conventional from RLAT and generates DVS by delaying RWEB to
components and boards. satisfy the setup time (tIOS ) and the hold time (tIOH )
Note that in the proposed architecture we rename constraints at the input of the controller. We define the
WEB as RWEB, since it is used for both read and write time delay by the DLL as tDLL , which is given by
modes.
tDLL = tIOD,max − tRW EBD,min + tIOS (2)
4.1 Proposed Interface Architecture where tRW EBD is the propagation delay of RWEB from
Fig. 5 shows the block diagram of the proposed DDR the input port of the NAND flash memory to the DLL,
synchronous NAND flash memory interface. As stated and tIOD is the data propagation delay from RLAT to
8 DRAFT

the IO pads of the NAND flash memory. Note that the Plugging Eq. (4) into Eq. (3) gives
small variation in data availability can easily be adjusted
tP,min = max{tOUT + (tREA + tIN + tS ) − tD , tBY T E } (5)
by the DLL block.
which further develops to
 
4.2 Write/Read Operation and Timing tOUT + (tREA + tIN + tS )
tP,min = max , tBY T E , (6)
Fig. 6 shows the write and read timing diagrams of the 1+α
proposed DDR synchronous NAND flash interface. In by applying Eq. (1) to Eq. (5). The maximum clock
the proposed interface, data is transferred at both rising frequency of the conventional design can then be de-
and falling edges of the RWEB signal in the write mode, termined by Eq. (6).
as represented in Fig. 6(a). The data transfer rate can
thus be improved by a factor of two compared with the 4.3.2 Proposed Interface
conventional design. In the read mode shown in Fig. 6 For the proposed architecture, the value of tP should be
(b), the controller asserts RWEB, instead of REB, to the at least the larger of tRW C and tBY T E , namely
NAND flash memory at tR after issuing the second CMD
is completed. At the same time, the first data is pre- tP,min = max{tRW C , tBY T E }, (7)
fetched to RLAT from the page register. The data are
since tRW C plays the role of tRC .
then moved from RLAT to the IO ports and the DLL
Recall that the parameters tIOS and tIOH represent
block that delays RWEB by tDLL for DVS generation.
the setup and hold time constraints of data with respect
Finally, the controller fetches the data at the falling edge
to DVS at the IO pad of the controller, respectively. By
of DVS. For the next series of data, DVS is generated in a
design, tRW C is identical to the period of DVS, which
similar manner, and the controller fetches at both edges
should be at least twice the sum of tIOS and tIOH , as
of DVS.
shown in Fig. 7(a). In other words. tP,min of the proposed
The major difference of the proposed design with
architecture is given by
respect to the conventional one is the concurrent propa-
gation of control signals and data. Hence, it is possible tP,min = max{(tIOS + tIOH ) × 2, tBY T E }, (8)
for the proposed scheme to have a shorter read cycle
where the term (tIOS + tIOH ) is doubled since the
than the conventional design.
proposed design supports DDR, and a single DVS cycle
should thus be long enough to manage two transfers.
4.3 Determining Operating Clock Period The architecture shown in Fig. 5 assumes that the
To compare the proposed and the conventional architec- controller and the NAND flash memory chips are in-
tures in terms of their maximum operating frequency, tegrated into a single board. Thus, tIOS and tIOH are
we calculate the minimum period of the system clock affected by the geometric parameters of the board-level
(i.e. tP,min ) for each architecture. interconnects. When the board-level design parameters
are available, we can derive an alternative representation
4.3.1 Conventional Interface of tP,min given by
By design, tP should be at least the larger of tRC and tP,min = max {(tS + tH + tDIF F ) × 2, tBY T E } , (9)
tW C , which are the periods of REB and WEB, respec-
where tS and tH are the setup and hold times of RFIFO,
tively. From Section 3.4 recall that tRC > tW C since the
respectively, and tDIF F is the difference between the
propagation of REB and data should be serialized and
arrival time of DVS to RFIFO and the arrival time of
happen within the same cycle in the read mode. Thus,
IO in the NAND flash memory to RFIFO. As informally
we can ignore tW C for computing tP,min .
shown in Fig. 7(b), tDIF F is caused by the different
To determine tP,min , we also need to consider tBY T E
interconnect delays of DVS and IO at the board level.
since the data transfer between the page register and
In Eq. (9), note that tS and tH are independent of the
RLAT occurs in a distinct clock cycle that precedes the
geometric parameters of the board and that tDIF F also
REB and data propagation. If this tBY T E parameter is
becomes a constant once the geometric parameters of the
greater than tRC , tP,min should be determined by tBY T E .
interconnects at the board have been decided.
Consequently, tP,min is given by
The maximum clock frequency of the proposed design
tP,min = max{tRC , tBY T E }. (3) can be determined from either Eq. (8) or Eq. (9).

Since RFIFO is clocked by D CON, which delays CLK


5 E XPERIMENTAL R ESULTS
by tD , the propagation of REB and data can take longer
than tRC , as already explained in Section 3.4. In other We present our results obtained from the experiments
words, the following equality should hold: conducted to evaluate the performance of the new in-
terface architecture proposed in Section 4. In particular,
tRC + tD = tOUT + tREA + tIN + tS . (4) we measured the write and read bandwidths of various
| {z } | {z }
For REB For data SSD architectures that utilize the proposed interface
CHUNG et al.: A HIGH-PERFORMANCE SOLID-STATE DISK WITH DOUBLE-DATA-RATE NAND FLASH MEMORY 9

tRWC

RWEB tPROG
tDS tDH tDS tDH tDS tDH tDS tDH

DDR IO[7:0]

CMD Address Writing Data CMD

(a) Write mode


tRWC

RWEB

tDLL
DVS
tR

tDS tDH tDS tDH tIOS tIOH tIOS tIOH

DDR IO[7:0]

CMD Address CMD Reading Data

(b) Read mode

Fig. 6: Timing diagrams of the proposed DDR synchronous NAND flash interface.

tRWC evaluate the read/write performance and energy con-


DVS
sumption at SSD-level.

tIOH tIOS tIOH tIOS


5.1 Experimental Setting
DDR IO[7:0] Based upon the basic architecture shown in Fig. 1, two
versions of SSD simulators were implemented: one for
(a) the conventional design and the other for the proposed
design. The former employs the asynchronous interface
SSD board
shown in Fig. 3, whereas the latter utilizes the DDR
DVS synchronous interface depicted in Fig. 5. The controllers
Controller NAND flash chip in both simulators were synthesized with the library
IO built on a 130-nanometer process technology. The worst-
case condition of this library consists of the IO voltage
of 2.7 volts (V), the internal voltage of 1.35 V, and the
(b) temperature of 125 ◦ C. The timing parameters of the
controllers shown in Fig. 3 and Fig. 5 were extracted
Fig. 7: Determining the minimum clock period of the using Synopsys PrimeTimer [29].
proposed architecture: (a) tRW C should be at least The NAND flash memory simulated in the experi-
(tIOS + tIOH ) × 2. (b) The interconnect delays for DVS ments was modeled at behavioral level with the timing
and IO are different. parameters specified in [26] and [27] for SLC and MLC
implementations, respectively, except for tBY T E . Choos-
ing a reasonable value of tBY T E is crucial for realistic
design but are based upon different architectural and simulation results since the maximum data transfer rate
device-level choices such as the amount of channels, the may be directly determined by tBY T E as shown in
degree of way interleaving and the type of flash cell Eqs. (6) and (9). If the value of tBY T E is too high, then
(i.e. SLC/MLC). In addition, we measured the energy the first terms in these equations are eclipsed by tBY T E
consumption of the proposed architecture. By compari- due to the max{·} operator. For our experiments, the
son with the conventional flash interface, we show how value of tBY T E was chosen from [28], which contains
much impact the proposed scheme has on the SSD-level the specifications of OneNAND, one of the fastest (i.e. of
performance in a variety of scenarios. the smallest tBY T E ) NAND flash memory commercially
After detailing the experimental setup in Section 5.1, available. Note that the conventional NAND flash mem-
we explain in Section 5.2 how the operating frequencies ory chips such as OneNAND are fabricated with only
of the tested architectures were determined. Section 5.3 a single metal layer due to cost issues. If an additional
presents the results of our experiments conducted to metal layer is used, tBY T E would decrease further, and
10 DRAFT

TABLE 2: NAND flash memory timing parameter values important performance metrics for comparing different
used in the experiments. SSDs, and ii) energy consumption.
Parameters Conventional (ns) Proposed (ns) Throughout the two sets of experiments detailed in
Sections 5.3.1 and 5.3.2, we wanted to see how the
tOU T 7.82 N/A
proposed architecture can guide the design decisions
tIN 1.65 N/A
about the internal channel architecture; this is critical
tS 0.25 0.25
since it can trade-off between the area and performance
tH 0.02 0.02
of the SSD under design.
tDIF F N/A 4.69
Three different interface designs were implemented
tREA 20 N/A
and compared: the conventional asynchronous interface
tBY T E 12 12
outlined in Section 3, the synchronous (but not double-
data-rate) interface proposed in [23] and the proposed
synchronous double-data-rate interface explained in Sec-
the performance gap between the proposed and the
tion 4. In this section, these designs are referred to as
conventional architectures would become wider.
CONV, SYNC ONLY and PROPOSED, respectively.
For the workload used in the experiments, we used
For convenience in implementation, the SYNC ONLY
widely used sequential traces that consist of 64-KB
architecture was not developed from the scratch but was
read/write data chunks [30]. The sequential traces repre-
derived from PROPOSED by replacing DDR transfers
sent the typical access patterns happening when a large
with single-data-rate transfers. The operating frequency
volume of data is written to or read from a storage based
of SYNC ONLY was thus set to 83 MHz.
on NAND flash memory. As host interface, the SATA
interface1 was used. Finally, the overall SSD system was 5.3.1 Architectures with Different Way Interleaving
modeled at behavior level, and all the aforementioned
We designed single-channel SSDs with five different
models were integrated using MentorGraphics Seam-
degrees of way interleaving: 1-way, 2-way, 4-way, 8-
less [31].
way and 16-way. The write and read performance of
each design was then measured for the three competing
5.2 Operating Frequency Determination interfaces and the two flash cell types, as shown in
Using the simulators we developed, the major timing Fig. 8 and Table 3. The experimental results we obtained
parameters of the proposed and the conventional inter- clearly indicate that the proposed design greatly im-
face architectures were measured, as listed in Table 2. proves the system performance in corporation with the
The value of tDIF F was measured using CubicWare [32], way-interleaving technique, as detailed below.
[33]; the difference of the loading capacitances of DVS • Case I (write, SLC): We first consider the SLC cases
and IO at the board set to 30 pF. The values of tS and shown in Fig. 8(a). For the 1-way design, the write per-
tH are identical for both architectures since they were formance of CONV and PROPOSED is similar, the latter
synthesized with the same library. Note that only the being better only by 9%. This marginal improvement
first five parameters in the table were obtained from originates from the fact that the data transfer time from
measurements; the rest are from the specification of the SSD controller to the NAND flash memory is much
NAND flash chips [26], [27], [28]. smaller than the cell program time tP ROG of the NAND
For the conventional SSD, the minimum data access flash memory. What PROPOSED reduces is the data
period tP,min n defined in Eq. (6) can transfer time, rather than tP ROG . By Amdahls’ law, the
o be evaluated as impact of reducing the data transfer time on the overall
7.82+20+1.65+0.25
tP,min = max , 12 = 19.81 nanosec-
1+0.5 performance is therefore diminished by the dominant
onds (ns) with the value of α = 0.5. Based on this, the size of tP ROG .
maximum data access rate of the conventional design However, as the degree of way interleaving is in-
was set to 50 MHz. For the proposed design, Eq. (9) is creased, the advantage of using PROPOSED becomes
evaluated as tP,min = max{0.25 + 0.2 + 4.69, 12} = 12 more evident. For CONV, the performance gain by way
ns, and the maximum data access rate of the proposed interleaving decreases as the number of ways increases,
design was set to 83 MHz. eventually being saturated at the 8-way design. In con-
trast, for PROPOSED, the interleaving effect was main-
5.3 SSD-Level Performance Analysis tained throughout all the degrees of way interleaving.
Note that CONV achieved only about 5x performance
We compared and contrasted the performance of the
gain as the number of ways changed from 1 to 16,
SSDs designed with the proposed synchronous DDR
whereas the performance gain by PROPOSED was more
interface with that of the SSDs using the conventional
than 11x under the same condition. For the 16-way
interface. The comparison criteria used were i) the write
design, PROPOSED outperformed CONV by 2.45 times.
and read speeds, which have become one of the most
This difference is caused by the fact that PROPOSED
1. We used SATA2 or “SATA 3 Gbit/s,” which supports the band- enables the controller to put more data in a fixed amount
width of up to 300 MB/s. of time (i.e. tP ROG ) than CONV.
CHUNG et al.: A HIGH-PERFORMANCE SOLID-STATE DISK WITH DOUBLE-DATA-RATE NAND FLASH MEMORY 11

TABLE 3: Details of the values drawn in Fig. 8.


120.00 Performance (MB/s) Ratio
CONV SYNC_ONLY PROPOSED
Cell Mode Way
C† S P P/S P/C
100.00
1 7.77 8.38 8.50 1.01 1.09
80.00
2 15.22 16.59 17.52 1.06 1.15
60.00 4 28.94 31.90 34.30 1.08 1.19
MB/s

Write
8 39.78 55.36 63.00 1.14 1.58
40.00
16 39.76 60.44 97.35 1.61 2.45
20.00 Mean‡ 26.29 34.53 44.13 1.16 1.42
SLC
1 27.78 36.66 47.89 1.31 1.72
0.00
1-way 2-way 4-way 8-way 16-way 1-way 2-way 4-way 8-way 16-way 2 42.78 67.16 70.47 1.05 1.65
Write Read 4 42.75 67.13 117.68 1.75 2.75
Read
8 42.72 67.11 117.64 1.75 2.75
16 42.69 67.11 117.59 1.75 2.75
(a) Single-Level Cell Mean 39.74 61.03 94.25 1.49 2.26
1 4.43 4.55 4.65 1.02 1.05
2 8.36 8.85 9.24 1.04 1.11
120.00
4 15.24 16.75 18.13 1.08 1.19
CONV SYNC_ONLY PROPOSED Write
100.00 8 25.86 29.72 34.08 1.15 1.32
16 32.45 45.99 57.23 1.24 1.76
80.00
Mean 17.27 21.17 24.67 1.11 1.26
MLC
60.00 1 26.04 33.58 42.69 1.27 1.64
MB/s

2 41.59 60.41 77.19 1.28 1.86


40.00
4 41.55 64.76 101.61 1.57 2.45
Read
20.00 8 41.52 64.75 110.56 1.71 2.66
16 41.50 64.73 110.52 1.71 2.66
0.00
1-way 2-way 4-way 8-way 16-way 1-way 2-way 4-way 8-way 16-way Mean 38.44 57.65 88.51 1.49 2.21
Write Read † C: CONV, S: SYNC ONLY, P: PROPOSED
‡ The arithmetic mean for columns 4–6; the geometric mean for
columns 7–8.
(b) Multi-Level Cell

Fig. 8: Write/read speed of single-channel SSDs designed • Case III (write/read, MLC): Fig. 8(b) shows the results
with different degrees of way interleaving (see Table 3 for the MLC NAND flash memory design. The read time
for more details). (tR ) and the program time (tP ROG ) parameters of MLC
devices are much larger than those of SLC devices. Thus,
the effect of way interleaving on the overall performance
The performance of SYNC ONLY lied between those decreases in MLC devices for the same degree of way
of CONV and PROPOSED, as expected from the fact that interleaving. This reduction in the effectiveness of way
SYNC ONLY does not support double-data-rate data interleaving is larger in the write mode than in the read
transfers. mode, since tP ROG is much larger than tR . This result
• Case II (read, SLC): This case is shown in the indicates that the proposed interface combined with the
right-hand side of Fig. 8(a). The overall performance interleaving technique can be more effective for high-
of reading was higher than that of writing for all the capacity storage devices that are composed of many
three interfaces tested. By design, the way-interleaving MLC chips than for low-capacity storages. We can also
technique can fully be effective during tR in the read deduce that the proposed design is more advantageous
mode, while it does not fully utilize tP ROG in the write for storage devices with many low-density MLC chips
mode. Even in this case, the way-interleaving technique than for storages with a small number of high-density
is more effective to PROPOSED, since the performance MLC chips.
of PROPOSED is saturated at the larger degree of way-
interleaving compared to CONV. Namely, PROPOSED 5.3.2 Architectures with Various Channel Configurations
and CONV are saturated when the degrees of way in- In practice, the capacity of a storage system is typically
terleaving are 4-way and 2-way, respectively. The relative determined earlier than micro-architectural design pa-
performance of PROPOSED over CONV in the read rameters such as the number of ways and channels.
mode was also higher than that in the write mode for Given a capacity value, we can explore the various
all degrees of way interleaving. For instance, PROPOSED combinations of ways and channels to search for optimal
outperformed CONV by a factor of 2.75 for the 16-way design. In this regard, we tested three different SSD
design. architectures of varying channel/way configurations
12 DRAFT

TABLE 4: Details of the values drawn in Fig. 9.


400.00
Ch- Performance (MB/s) Ratio
350.00 CONV SYNC_ONLY PROPOSED
Cell Mode
Way C† S P P/S P/C
300.00
1-16 39.76 60.44 97.35 1.61 2.45
250.00
2-8 74.07 101.99 114.83 1.13 1.55
MB/s

200.00 Write
150.00
4-4 103.76 115.68 123.52 1.07 1.19
100.00
Mean‡ 72.53 92.70 111.90 1.25 1.65
SLC
50.00 1-16 42.69 67.11 117.59 1.75 2.75
0.00 2-8 81.44 126.70 224.82 1.77 2.76
Read
1CH 16W 2CH 8W 4CH 4W 1CH 16W 2CH 8W 4CH 4W
4-4 155.35 237.61 max§ – –
Write Read
Mean 93.16 143.81 235.25 1.76 2.76
1-16 32.45 45.99 57.23 1.24 1.76
(a) Single-Level Cell 2-8 48.72 56.83 64.75 1.14 1.33
Write
4-4 57.46 63.55 68.49 1.08 1.19
Mean 46.21 55.46 63.49 1.15 1.41
400.00
MLC
1-16 41.50 64.73 110.52 1.71 2.66
350.00 CONV SYNC_ONLY PROPOSED
2-8 79.32 122.48 201.42 1.64 2.54
300.00 Read
4-4 150.94 230.17 max – –
250.00
Mean 90.59 139.13 217.18 1.68 2.60
200.00
MB/s

150.00 † C: CONV, S: SYNC ONLY, P: PROPOSED


‡ The arithmetic mean for columns 4–6; the geometric mean for
100.00

50.00
columns 7–8.
§ Reached the maximum bandwidth of the SATA interface.
0.00
1CH 16W 2CH 8W 4CH 4W 1CH 16W 2CH 8W 4CH 4W

Write Read

way interleaving are used. This is because the interval to


(b) Multi-Level Cell which the way-interleaving technique is applied is much
shorter in the read mode (i.e. tR in the read mode versus
Fig. 9: Write/read speed of SSDs designed with different tP ROG in the write mode). Note that the read bandwidth
numbers of channels and degrees of way interleaving is much higher than the write bandwidth. Thus, the read
(see Table 4 for more details). bandwidth of the (4-channel, 4-way) configuration in
Fig. 9(a) actually reached the bandwidth of the SATA
host interface we used.
(1-channel/16-way, 2-channel/8-way and 4-channel/4- • Case III (write/read, MLC) Fig. 9(b) shows the result
way), while keeping the product of channels and ways from simulating MLC-based SSD designs. The overall
constant. In other words, the number of NAND flash performance pattern is similar to that appearing in
chips (i.e. total capacity) used in each architecture was Fig. 9(a). However, the degree of performance improve-
kept identical. Throughout this experiment, we wanted ments is smaller than that in the SLC case. For instance,
to determine the optimal number of channels and the in the SLC-based design, the read bandwidth of PRO-
degree of way interleaving, considering the trade-off POSED was improved by 1.91 times as the configuration
between performance and area. For each design, the changes from (1-channel, 16-way) to (2-channel, 8-way).
write and read speeds were measured both for SLC- and In contrast, in the MLC-based scheme, the read perfor-
MLC-based implementations. The results are shown in mance of PROPOSED increases only by 1.81 times for
Fig. 9 and Table 4. the same change in channel and way configuration.
• Case I (write, SLC): In the write mode shown in This phenomenon becomes more evident in the write
Fig. 9(a), the performance of PROPOSED increased more mode. This is again related to the length of the period
slowly than that of CONV as the area (i.e. the number of to which the way-interleaving technique can be applied.
channels) increased. In our experiment, the architectures This period in the write mode is tP ROG , which is much
designed with more channels have fewer degrees of way larger than the counterpart tR in the read mode. In
interleaving, and thus the benefits of using PROPOSED the write mode, a larger degree of way interleaving is
decreases as more channels are used. In the write mode, required in order to saturate the channel bandwidth.
it would therefore be better to increase the degree of way Thus, increasing channels is in effect only when the
interleaving than to increase the number of channels if degree of way interleaving is sufficiently large. Typically,
a tight area budget is given. the difference in tP ROG between MLC and SLC is much
• Case II (read, SLC): Unlike the write mode, the larger than the difference in tR between MLC and SLC.
performance of the three interfaces increases in an almost Therefore, the performance degradation of MLC-based
identical fashion as more channels and fewer degrees of SSDs is more clearly seen in the write mode.
CHUNG et al.: A HIGH-PERFORMANCE SOLID-STATE DISK WITH DOUBLE-DATA-RATE NAND FLASH MEMORY 13

TABLE 5: Details of the values drawn in Fig. 10.


]UWW
jvu} z€uj†vus€ wyvwvzlk Energy (nJ/B) Ratio
Cell Mode Way
\UWW C† S P P/S P/C
Energy per Byte (nJ/Byte)

[UWW 1 2.90 5.01 5.47 1.09 1.89


2 1.48 2.53 2.65 1.05 1.80
ZUWW
4 0.78 1.32 1.36 1.03 1.74
Write
YUWW 8 0.57 0.76 0.74 0.97 1.30
16 0.57 0.69 0.48 0.69 0.84
XUWW
Mean‡ 1.26 2.06 2.14 0.95 1.45
SLC
WUWW 1 0.81 1.15 0.97 0.85 1.20
1-way 2-way 4-way 8-way 16-way 1-way 2-way 4-way 8-way 16-way
2 0.53 0.63 0.66 1.06 1.25
Write Read
4 0.53 0.63 0.40 0.63 0.75
Read
8 0.53 0.63 0.40 0.63 0.75
Fig. 10: Energy consumed by different SSD controllers to 16 0.53 0.63 0.40 0.63 0.75
transfer a single byte (see Table 5 for more details). Unit: Mean 0.58 0.73 0.56 0.74 0.91
nano-Joule per byte. † C: CONV, S: SYNC ONLY, P: PROPOSED
‡ The arithmetic mean for columns 4–6; the geometric mean for
columns 7–8.
5.3.3 Energy Consumption Comparison
To see the impact of the proposed architecture on en- and 1.09–2.45 times in the write mode for the SLC-
ergy consumption, we first measured the average power architectures we considered. For the MLC-based archi-
consumption of the SSD controllers that adopt different tectures tested, the new design we propose improves the
interfaces, when these controllers read or write the same performance by 1.64–2.66 times in the read mode and
amount of data. Note that the operating frequencies 1.05–1.76 times in the write mode over the conventional
of CONV, SYNC ONLY and PROPOSED are different. design. The proposed scheme can dramatically increase
Thus, for fair comparison, we further divided the power the operating frequency of the interface, only limited by
consumption of an interface by the bandwidth (mea- tBY T E , which is the device-level parameter characterizes
sured in megabytes per second) this interface operates the read time of a flash cell. As process technology
at. In other words, we compared the energy consumed advances, tBY T E will keep decreasing, and the impact
by the SSD controllers to transfer a single byte of data. of our scheme will become more prominent.
Fig 10 and Table 5 show the result we obtained from
simulating the SLC-based designs for write and read ACKNOWLEDGMENT
operations for various degrees of way interleaving. For This work was supported by Samsung Electronics Co.,
low degrees of way interleaving, PROPOSED consumed Ltd., by IC Design Education Center (IDEC), by a KOSEF
more energy than CONV to read or write the same grant funded by the Korean Government (MEST)(No.
amount of data. However, as the degree of way interleav- 2009-0079888) and by a KRF grant funded by the Korean
ing increases, the energy consumed by PROPOSED grad- Government (MOEHRD) (No. KRF-2007-313-D00578).
ually became the smallest among the alternatives. Due
to the performance issues, as discussed in Section 5.3.1, R EFERENCES
it is likely that most SSDs will continue to be designed [1] R. Katz, G. Gibson, and D. Patterson, “Disk system architectures
with a reasonably high degree of way interleaving. For for high performance computing,” Proceedings of the IEEE, vol. 77,
such design, adopting the proposed interface would be no. 12, pp. 1842–1858, 1989.
[2] Y. Choi, K. Suh, Y. Koh, J. Park, K. Lee, Y. Cho, and B. Suh,
highly beneficial, since it outperforms the alternatives “A high speed programming scheme for multi-level NAND flash
not only in terms of the read/write bandwidth but also memory,” in Proceedings of Symposium on VLSI Circuits, 1996, pp.
with respect to the energy efficiency. 170–171.
[3] J. Kim, K. Sakui, S. Lee, Y. Itoh, S. Kwon, K. Kanazawa, K. Lee,
H. Nakamura, K. Kim, T. Himeno et al., “A 120-mm 2 64-Mb
6 C ONCLUSION NAND flash memory achieving 180 ns/Byteeffective program
speed,” IEEE Journal of Solid-State Circuits, vol. 32, no. 5, pp. 670–
We have proposed a novel SSD architecture that exploits 680, 1997.
double-data-rate synchronous NAND flash interface. [4] K. Takeuchi and T. Tanaka, “A dual-page programming scheme
for high-speed multigigabit-scale NAND flash memories,” IEEE
This new design not only enhances the write and read Journal of Solid-State Circuits, vol. 36, no. 5, pp. 744–751, 2001.
performance but also retains the backward compatibil- [5] J. Lee, H. Im, D. Byeon, K. Lee, D. Chae, K. Lee, S. Hwang,
ity with existing single-data-rate asynchronous NAND S. Lee, Y. Lim, J. Lee et al., “High-performance 1-Gb-NAND flash
memory with 0. 12-mum technology,” IEEE Journal of Solid-State
flash memory. The performance of the SSDs that exploit Circuits, vol. 37, no. 11, pp. 1502–1509, 2002.
the way-interleaving technique can be greatly improved [6] K. Imamiya, H. Nakamura, T. Himeno, T. Yarnamura, T. Ikehashi,
by adopting the proposed approach. Our experimental K. Takeuchi, K. Kanda, K. Hosono, T. Futatsuyama, K. Kawai et al.,
“A 125-mm/sup 2/1-Gb NAND flash memory with 10-MByte/s
results show that the proposed architecture outperforms program speed,” IEEE Journal of Solid-State Circuits, vol. 37, no. 11,
the conventional one by 1.65–2.76 times in the read mode pp. 1493–1501, 2002.
14 DRAFT

[7] T. Hara, K. Fukuda, K. Kanazawa, N. Shibata, K. Hosono, H. Mae- for NAND type flash memory systems,” in Proceedings of the Third
jima, M. Nakagawa, T. Abe, M. Kojima, M. Fujiu et al., “A 146- International Conference on Information Technology and Applications
mm 2 8-Gb multi-level NAND flash memory with 70-nm CMOS (ICITA 2005), vol. 1, 2005.
technology,” IEEE Journal of Solid-State Circuits, vol. 41, no. 1, pp. [20] W. Huang, C. Chen, and C. Chen, “The Real-Time Compression
161–169, 2006. Layer for Flash Memory in Mobile Multimedia Devices,” in
[8] K. Takeuchi, Y. Kameda, S. Fujimura, H. Otake, K. Hosono, Proceedings of International Conference on Multimedia and Ubiquitous
H. Shiga, Y. Watanabe, and T. Futatsuyama, “A 56-nm CMOS Engineering (MUE’07), 2007, pp. 171–176.
99-Formula Not Shown 8-Gb Multi-Level NAND Flash Memory [21] L. Chang and T. Kuo, “An adaptive striping architecture for flash
With 10-MB/s Program Throughput,” IEEE Journal of Solid-State memory storage systems of embedded systems,” in Proceedings of
Circuits, vol. 42, no. 1, p. 219, 2007. the Eighth IEEE Real-Time and Embedded Technology and Applications
[9] J. Kim, J. Kim, S. Noh, S. Min, and Y. Cho, “A space-efficient flash Symposium, 2002, pp. 187–196.
translation layer for compactflash systems,” IEEE Transactions on [22] S. Lim and K. Park, “An efficient NAND flash file system for
Consumer Electronics, vol. 48, no. 2, pp. 366–375, 2002. flash memory storage,” IEEE Transactions on Computers, pp. 906–
[10] S. Kim and S. Jung, “A log-based flash translation layer for 912, 2006.
large NAND flash memory,” in Proceedings of the 8th International [23] C. Son, S. Yoon, S. Chung, C. Park, and E. Chung, “Variability-
Conference (ICACT 2006), 2006, pp. 1641–1644. insensitive scheme for NAND flash memory interfaces,” Electron-
[11] C. Wu and T. Kuo, “An adaptive two-level management for the ics Letters, vol. 42, no. 23, pp. 1335–1336, 2006.
flash translation layer in embedded systems,” in Proceedings of the [24] http://www.onfi.org.
2006 IEEE/ACM international conference on Computer-aided design. [25] R. Schuetz, H. Oh, J. Kim, H. Pyeon, S. Przybylski, and P. Gilling-
ACM New York, NY, USA, 2006, pp. 601–606. ham, “Hyperlink nand flash architecture for mass storage appli-
[12] M. Assar, S. Nemazie, P. Estakhri et al., “Flash memory mass stor- cations,” in Proceedings of IEEE Non-Volatile Semiconductor Memory
age architecture incorporation wear leveling technique,” Dec. 26 Workshop, 2007, pp. 3–4.
1995, united States Patent 5,479,638. [26] K9F1G08U0B 128M x 8-bit NAND Flash Memory Data Sheet V1.0,
[13] http://www.serialata.org. Samsung Electronics Company, 2006.
[14] D. Kim, K. Bang, S. Ha, C. Park, S. Chung, and E. Chung, “Solid- [27] K9GAG08U0M 2G x 8-bit NAND Flash Memory Data Sheet V1.0,
State Disk with Double Data Rate DRAM Interface for High- Samsung Electronics Company, 2006.
Performance PCs,” IEICE Trans. on Information and Systems, vol. [28] FK8G16Q2M 2G 2Gb MuxOneNAND M-die Data Sheet V1.1, Sam-
E92-D, no. 4, pp. 727–731, 2009. sung Electronics Company, 2007.
[15] C. Park, P. Talawar, D. Won, M. Jung, J. Im, S. Kim, and Y. Choi, [29] http://www.synopsys.com/Tools/Implementation/SignOff/
“A high performance controller for NAND Flash-based solid state Pages/PrimeTime.aspx.
disk (NSSD),” in Proceedings of the 21st Non-Volatile Semiconductor [30] MultiMediaCard System Specification Version 4.2, MMCA MultiMe-
Memory Workshop (IEEE NVSMW), 2006, pp. 17–20. diaCard Association, 2006.
[16] D. Ryu, “Solid state disk controller apparatus,” Dec. 19 2005, [31] http://http://www.mentor.com/products/fv/seamless/.
united States Patent App. 11/311,990. [32] M. Jang, H. Jin, B. Lee, J. Lee, S. Song, T. Kim, and J. Kong,
[17] J. Lee and D. Ryu, “Semiconductor solid state disk controller,” “CubicWare: a hierarchical design system for deep submicron
Nov. 9 2006, united States Patent App. 11/594,893. ASIC,” in In Proceedings of the Twelfth Annual IEEE International
[18] K. Yim, H. Bahn, and K. Koh, “A flash compression layer for ASIC/SOC Conference, 1999, pp. 168–172.
SmartMedia card systems,” IEEE Transactions on consumer Elec- [33] http://www.samsung.com/global/business/semiconductor/
tronics, vol. 50, no. 1, pp. 192–197, 2004. products/asic/Products EDASupport.html.
[19] W. Huang, C. Chen, Y. Chen, and C. Chen, “A compression layer

View publication stats

You might also like