A High-Performance Solid-State Disk With Double-Da
A High-Performance Solid-State Disk With Double-Da
net/publication/272195445
CITATIONS READS
2 2,813
6 authors, including:
All content following this page was uploaded by Sungroh Yoon on 26 June 2016.
Abstract—We propose a novel solid-state disk (SSD) architecture that utilizes a double-data-rate synchronous NAND flash interface
for improving read and write performance. Unlike the conventional design, the data transfer rate in the proposed design is doubled in
harmony with synchronous signaling. The new architecture does not require any extra pins with respect to the conventional architecture,
arXiv:1502.02239v1 [cs.AR] 8 Feb 2015
thereby guaranteeing backward compatibility. For performance evaluation, we simulated various SSD designs that adopt the proposed
architecture and measured their performance in terms of read/write bandwidths and energy consumption. Both NAND flash cell types,
namely single-level cells (SLCs) and multi-level cells (MLCs), were considered. In the experiments using SLC-type NAND flash chips,
the read and write speeds of the proposed architecture were 1.65–2.76 times and 1.09–2.45 times faster than those of the conventional
architecture, respectively. Similar improvements were observed for the MLC-based architectures tested. It was particularly effective to
combine the proposed architecture with the way-interleaving technique that multiplexes the data channel between the controller and
each flash chip. For a reasonably high degree of way interleaving, the read/write performance and the energy consumption of our
approach were notably better than those of the conventional design.
Index Terms—Solid-state disk (SSD), Double-data rate (DDR), NAND flash memory, Interleaving
bit cost, which is still much higher than that of HDDs. SSD
NAND flash chip
The core competency of SSDs over the HDDs can Controller
thus be obtained by trading off the access time and the Processor
XY
Decoder
Cell Array
Controller NAND flash chips machine happens to be found in the cache buffer, we
NAND CH1 Way1 can completely eliminate the data access time to NAND
NAND_IF
Processor & ECC
NAND CH1 Way2 flash memory.
NAND CH1 Way3
Logic NAND CH1 Way4 Refer to Sections 2.3.1 and 2.3.2 for a brief survey of
ROM
NAND_IF NAND CH2 Way1 the existing approaches that belong to this category.
RAM NAND CH2 Way2
& ECC NAND CH2 Way3
Region for FW
Region for CH1
Logic NAND CH2 Way4 2.2.2 Improving Host Interface
Region for CH2
NAND_IF NAND CH3 Way1 This option is to increase the bandwidth between the
Region for CH3 NAND CH3 Way2
Region for CH4
& ECC
NAND CH3 Way3 SSD and its host machine. Currently, SSDs are attached
Logic NAND CH3 Way4 to the host machine via legacy interfaces inherited from
HOST _IF NAND CH4 Way1 HDDs such as parallel advanced technology attachment
NAND_IF
NAND CH4 Way2
& ECC
DRAM_IF NAND CH4 Way3 (PATA) and serial-ATA (SATA) [13]. To achieve higher
Logic
NAND CH4 Way4
performance with less pin counts, SATA is rapidly re-
placing PATA these days both for HDDs and SSDs. In
Fig. 2: An SSD architecture with 4 channels and 4 ways
addition, to handle properly the increased bandwidth of
per channel.
SSDs, alternative high-speed interfaces such as periph-
eral component interconnect express (PCIe) have been
tried for interfacing SSDs. Recently, it was proposed in
frequently used techniques is to increase data through- [14] to attach SSDs to the North Bridge chipset using
put by parallelizing the data paths between the controller the DRAM interface, instead of using the South Bridge
and NAND flash chips. Such paths are called channels, chipset in which the SATA and PATA controllers reside.
and there are largely two methods for the paralleliza-
tion. One is called channel striping, which means using 2.2.3 Accelerating NAND Flash Interface
multiple channels in the NAND flash interface. The This is to increase the bandwidth between the controller
other is called way interleaving, and this is to multiplex and each NAND flash memory chip. Even though the
each channel to send data in a round-robin fashion. By objective of this option is similar to that of channel strip-
exploiting these techniques, it is possible to hide much ing or way interleaving, this option is more aggressive
of the latency of NAND flash memory. in the sense that the read and write bandwidths can be
Fig. 2 is an example of the SSD architecture adopting improved by reducing the latency directly, rather than
the techniques of channel striping and way interleaving hiding it. A key technique in this category is to improve
simultaneously. The number of channels and ways in the NAND flash interface scheme in a synchronous
this example are both four. Of note is that channel fashion. Section 2.3.3 presents more details of existing
striping is often more costly than way interleaving, techniques for accelerating NAND flash interfaces.
since each channel requires a NAND interface block
and an error correction code (ECC) block. The ECC
2.3 Related Work
block is essential for data reliability, especially when
the MLC flash is used. Another area penalty of multi- 2.3.1 Hiding the Latency of NAND Flash Memory
channel design comes from increased pin counts. Each The effect of channel striping and way interleaving was
channel requires dedicated pins to communicate with the extensively studied in [15], which used a 2-channel, 4-
dedicated NAND flash memory chips. For this reason, way-interleaving interface scheme with a software ar-
the number of channels should carefully be selected in chitecture adopting a hybrid-mapping algorithm. The
order to achieve the required system performance within proposed system outperformed the compared HDD by
the area budget. 77%. The improvement was mainly due to the increased
Another performance improvement technique from parallelism and the interleaved accesses when program-
the controller perspective is to optimize the software ming NAND flash memory. However, the limitations
called flash translation layer (FTL) [9], [10], [11]. FTL runs of this approach include area overhead and compli-
on the processor of an SSD controller and performs cated controller design due to the increased number
mapping between logical and physical addresses and of channels. Other approaches to latency-hiding include
also handles important housekeeping tasks such as wear the techniques proposed in [16], [17], where DRAM
leveling [12] and garbage collection. Wear leveling is was used as the cache buffer for NAND flash memory.
to use all the flash cells in a chip as uniformly as When a cache hit occurs, the data access time is solely
possible and plays a critical role to maintain the initial determined by the DRAM access time, which is much
performance and capacity of an SSD over time, since smaller than the flash access time.
the lifetime of a flash cell is directly limited by its write
frequencies. 2.3.2 Optimizing the Firmware of SSD Controller
Besides, in most commercially available SSDs, DRAM The techniques in this category aim at enhancing the
is used as a cache buffer to hide the long access latency of SSD performance by reducing the data transfer size,
NAND flash memory. If the data requested by the host operating time, and the number of extra operations
4 DRAFT
required for wear leveling. The technique presented in 3 C ONVENTIONAL A SYNCHRONOUS NAND
[18], [19], [20] compresses the data from the host unit to F LASH I NTERFACE FOR S OLID -S TATE D ISKS
save the storage space in NAND flash memory and to
The overall structure of a typical SSD was explained in
reduce the data transfer time from the controller to flash
Section 2. In this section, we present additional details
chips. However, this method may incur extra time and
on the conventional method for interfacing the controller
area overheads for data (de)compression. The hybrid-
and the NAND flash chips in SSDs. The material in this
mapping technique proposed in [9] aimed at improving
section is crucial for understanding the new interface
the write speed by introducing two types of logical
architecture proposed in Section 4. The major difference
blocks called data blocks and log blocks. The number of
between the two architectures lies in the controller-flash
log blocks is much smaller than that of data blocks, and
interface; the conventional interface uses a asynchronous
data is always written to log blocks first. When all log
single-data-rate scheme, whereas the proposed design
blocks are used up, the FTL moves the data from log
utilizes a synchronous double-data-rate scheme.
blocks to data blocks. This technique may incur extra
computation overhead but can be beneficial for quick
search owing to the small number of log blocks. The 3.1 Block Diagram and Key Components
techniques introduced in [10], [11], [21], [22] can reduce Fig. 3 shows the conventional asynchronous interface ar-
the number of erase operations by using a page-map chitecture. Note that only the NAND IF block is drawn
cache and smart mapping strategies; it was shown that inside the controller block for clarity, although there
the system performance can be enhanced by reducing exist additional blocks, as shown in Fig. 1.The NAND IF
the number of erase and garbage collection operations. block and the NAND flash chip communicate over three
types of ports. The upper two ports are for transferring
data strobe signals, and the lower one is for exchanging
all the other control signals as well as data.
2.3.3 Improving Controller-Flash Interface Inside the NAND IF block, there are two blocks called
generate write (Gen W) and generate read (Gen R). The
In [23], the authors introduced a synchronous NAND signal to control writes is called write enable bar (WEB)
flash interface using a signal called data valid strobe and is generated by the Gen W block. The read control
(DVS). This interface improved the sensitivity to the signal is named read enable bar (REB) and is produced by
process, voltage, and temperature (PVT) as well as the the Gen W block. WEB and REB are sent to the NAND
read performance by isolating the timing of the con- flash chip via the upper two ports of the interface. The
troller from that of the NAND flash memory. However, D CON block is to delay the clock (CLK) so that data
this approach exploited only one edge of each clock transfers at the interface can fulfill any given timing
signal, producing limited performance improvements. specifications. The blocks called WFIFO and RFIFO are
The focus of this work was more on desensitizing PVT for buffering data from and to the host, respectively.
variations rather than on boosting read and write per- The IO latches inside the flash chip include timing-
formance. critical parts called write latch (WLAT) and read latch
(RLAT). WLAT temporarily stores the data from the con-
Recently, some leading companies in the SSD business troller to the page register, whereas RLAT temporarily
organized an initiative called open NAND flash interface stores the data from the page register to the controller.
(ONFI) and proposed a DDR flash interface scheme,
whose specification is available at [24]. Additionally,
3.2 Timing Parameters
the authors in [25] proposed a similar concept along
with a new SSD architecture. However, these approaches To explain the write and read operations of the SSD in-
require additional pins, thus causing compatibility issues terface architecture in Sections 3.3 and 3.4, we first show
and area overhead. Furthermore, no quantitative analy- in Table 1 a number of important timing parameters for
sis was performed to prove the effectiveness of these the interface building blocks. In the table, note that the
approaches and to show the impact of DDR interface first eight parameters are common for the conventional
schemes on the SSD performance. and the proposed interfaces. The next four are only for
the conventional architecture; the rest are only for the
Our work presented in this paper belongs to the
proposed architecture detailed in Section 4. Additional
category of techniques to accelerate the interface be-
timing parameters of NAND flash chips themselves are
tween SSD controller and NAND flash chips. Unlike
available in [26], [27], [28].
the aforementioned approaches, our DDR synchronous
interface scheme provides pin-level compatibility with
the traditional NAND flash memory interface. Moreover, 3.3 Write Operation and Timing
we evaluate the effect of the proposed technique quan- Fig. 4(a) shows the write timing diagrams of the con-
titatively with respect to various architectural choices ventional NAND flash memory interface. The controller
(e.g. the number of channels and ways) from the SSD asserts WEB and issues the first write command (CMD)
perspective. to the flash chip in order to initiate a write operation.
CHUNG et al.: A HIGH-PERFORMANCE SOLID-STATE DISK WITH DOUBLE-DATA-RATE NAND FLASH MEMORY 5
SSD Board
Controller NAND flash chip
tOUT
Cell Array
NAND_IF
tPROG tR
CLK WEB
F/F Gen_W
(tP) Control
Page Register
F/F REB Logic
Gen_R
tBYTE
WDATA tDS/tDH tBYTE
WFIFO WLAT
IO
RDATA RLAT
RFIFO
D_CON DCLK
IO Latches
Fig. 3: Block diagram of the NAND flash memory interface in the conventional SSD architecture.
TABLE 1: Timing parameters for the conventional and proposed interface architectures.
Parameter Conventional (Fig. 3) Proposed (Fig. 5)
tP Clock (CLK) period
tD Delay amount of CLK by D CON (i.e. difference between CLK and DCLK); tD = α · tP , where 0 ≤ α ≤ 1/2
tS /tH Setup/hold time of WFIFO and RFIFO
tR Data fetch time (from Cell Array to Page Register)
tP ROG Program time (from Page Register to Cell Array)
tBY T E Data transfer time between Page Register and WLAT/RLAT
tW C Write cycle time (i.e. one cycle of WEB)
tRC Read cycle time (i.e. one cycle of REB)
tIN Data propagation time between the IO pad
of the controller and WFIFO/RFIFO
tOU T Signal propagation time from FFs of the controller
to the strobe pads of NAND flash memory N/A
tDS /tDH Setup/hold time of IO signals with respect to WEB
tREA Data transfer time from RLAT to
the IO pad of the controller
tDIF F Difference between the arrival time of DVS at RFIFO
and the arrival time of IO in the NAND flash at RFIFO
tDLL Time delay by DLL as defined in Eq. (2)
tRW EBD Propagation delay of RWEB from
N/A the strobe port of NAND flash memory to DLL
tIOS /tIOH Setup/hold time of IO signals with respect to DVS
tIOD Data propagation delay from RLAT to the IO pad
of NAND flash memory
tRW C One cycle of RWEB; replaces tRC and tW C
The destination addresses are then sent to the flash chip to the flash chip; the delays of the control and data
followed by a series of data to be written to the page paths are almost identical. The conventional interface
register through WLAT at every tW C , the period of WEB. operates synchronously in the write mode in the sense
Finally, the controller issues a program CMD to transfer that transfers are synchronized to the periodic WEB
the data in the page register to the cell arrays of the flash signal under the timing constraints set by tDS and tDH .
chip. During the program phase, the flash memory chip The data transfer rate in the write mode can therefore be
enters the busy state and cannot be interrupted until the improved by increasing the frequency of WEB. However,
end of the program phase. This time duration is defined the conventional interface is not considered synchronous
as tP ROG and is normally very long. due to the asynchronous read mode, as will be explained
next.
Note that, in the write mode, both control (i.e. WEB)
and data are concurrently transferred from the controller
6 DRAFT
tWC
WEB tPROG
tDS tDH tDS tDH tDS tDH
IO[7:0]
WEB
tP
CLK
tD
tRC
REB
tR (Busy state)
tREA
DCLK
IO[7:0]
Fig. 4: The timing diagrams of the conventional asynchronous NAND flash memory interface.
3.4 Read Operation and Timing to the flash chip, and then the data transfer occurs in
The timing diagrams for the read operation are shown in the opposite direction. Consequently, a single read cycle
Fig. 4 (b). After issuing the first read CMD followed by should be determined by the sum of the propagation
the destination address, the second read CMD is issued delays of REB and data, unlike the write mode in which
to the flash chip. It then enters the busy state for fetching a write cycle can be set by the maximum of the two
data from the cell arrays to the page register. This data delays. For this reason, tRC is normally longer than
fetching time is defined as tR , which is much shorter tW C , although the specification of commercial NAND
than tP ROG . Thus, the data transfer time between the cell flash memory usually lists identical timing parameters
arrays and the page register is not as critical in the read for convenience. The new interface architecture proposed
mode as it was in the write mode. At the completion in the next section focuses on reducing the read cycle
of the fetch, the flash chip enters the ready state, and time in order to enhance read performance.
the controller periodically asserts REB to the flash chip
with the period of tRC . For each REB cycle, the control 4 P ROPOSED DDR S YNCHRONOUS NAND
logic inside the flash chip instructs a single data transfer F LASH I NTERFACE FOR S OLID -S TATE D ISKS
from the page register to RLAT within tBY T E , and the In this section, we provide the details of the proposed
data reach the IO ports of the controller within tREA . NAND flash interface for improving SSD performance.
The controller then fetches the data into RFIFO at the This new architecture focuses on enhancing the data
positive edge of DCLK, a delayed version of CLK by tD . throughput between NAND flash memory chips and
More precisely, tD is defined as the SSD controller. To this end, the proposed scheme
tD = α · tP , (1) operates in a synchronous manner for both read and
write modes and supports double-data-rate transfers.
where 0 ≤ α ≤ 12 . Note that DCLK is used to satisfy As highlighted in Section 3, a major performance bot-
the setup time constraint imposed on RFIFO. Without tleneck in the conventional NAND flash memory is the
DCLK, the system may easily violate the timing con- serialized, opposite-directional propagation of control
straint due to the variations of tIN , tOUT , and tREA . and data in the read mode. The propose interface breaks
Thus, each operation of propagating REB and fetching this serialized propagation paths into two smaller ones
data is allowed to take at most tRC + tD , instead of tRC . — one for control and the other for data — and isolate
It is critical to notice the following: In the read mode them from the perspective of timing. More precisely, the
of the conventional interface, the control (i.e. REB) and REB control is generated by CLK and is propagated just
data cannot be propagated concurrently, unlike the write as in the conventional architecture. On the other hand,
mode. That is, REB is first propagated from the controller the data is fetched from the flash chip to the controller
CHUNG et al.: A HIGH-PERFORMANCE SOLID-STATE DISK WITH DOUBLE-DATA-RATE NAND FLASH MEMORY 7
SSD board
Controller NAND flash chip
NAND_IF Cell Array
RWEB tPROG tR
CLK F/F Gen_W
(tP) tRWEBD Control Page Register
F/F Gen_R DVS Logic
M tBYTE
D_CON U MUX
X tBYTE
WDATA M tIOS/tIOH WLAT
WFIFO0 DLL
U WLAT
WFIFO1 X
M RLAT
M tDIFF IO U
RDATA RFIFO0
U X RLAT
X RFIFO1
IO Latches
tIOD
tS/tH
Fig. 5: Block diagram of the proposed double-data-rate synchronous NAND flash memory interface.
in synchronization with a new control signal named data early, REB has been replaced by DVS for synchronous
valid strobe (DVS), as depicted in Fig. 5. DVS is a data operations, and the FIFOs and latches have been du-
strobe asserted by the flash chip and can be considered plicated for DDR operations. The multiplexers are used
as a data clock whose edges indicate stable points for inside the NAND flash chip in order to select WLAT for
data fetching. writes and RLAT for reads, depending on the edge type
Introducing DVS is for the synchronous read opera- of RWEB. Now that RWEB is commonly used for both
tion. To support DDR operation, we duplicate the RFIFO read and write modes, we do not need to distinguish
and WFIFO buffers inside the controller and the RLAT tW C and tRC and thus use tRW C as the common timing
and WLAT latches inside the flash chip. In the controller, parameter representing tW C and tRC . The D CON and
one pair of RFIFO and WFIFO is dedicated to the rising Gen R blocks are not required in the proposed interface
edge of CLK, and the other pair to the falling edge of design but are included in the design shown in Fig. 5
CLK; in the flash chip, one pair of RLAT and WLAT is for guaranteeing backward compatibility.
for the rising edge of DVS, and the other pair for the Note that the timing-critical path in the read mode
falling edge for DVS. is broken into two parts in the proposed design. One is
The notion of DVS was first introduced in [23], but the path for propagating RWEB, and the other is the data
the purpose of that work was not to increase the data path from the NAND flash memory to the controller. The
bandwidth but to desensitize the PVT variations as delay of the first path determines tRW C , since RWEB
discussed in Section 2.3. In contrast to [23], the proposed propagates through the same path in the write mode.
design can enhance the overall read/write performance Thus, tRW C is identical to tW C , rather than tRC of the
of an SSD by allowing double-data-rate data transfers conventional NAND flash memory. The delay of the
between the controller and flash memory. We compare data path in the proposed architecture is shorter than
the performance of the interface introduced in [23] and tRC of the conventional architecture. This is because
that of the proposed architecture in Section 5. the propagation delay of RWEB does not need to be
The proposed scheme differs from the popular DDR considered for calculating the data propagation delay.
DRAM interface in that the proposed architecture does Consequently, the proposed interface can provide higher
not require an additional memory clock, since REB is data throughput than the conventional one can.
replaced by the bidirectional DVS signal. Replacing REB To generate DVS at a stable data point, we use a delay-
by DVS, rather than adding an extra pin, is beneficial for locked loop (DLL) circuit. DLL is triggered by the data
maintaining backward compatibility with conventional from RLAT and generates DVS by delaying RWEB to
components and boards. satisfy the setup time (tIOS ) and the hold time (tIOH )
Note that in the proposed architecture we rename constraints at the input of the controller. We define the
WEB as RWEB, since it is used for both read and write time delay by the DLL as tDLL , which is given by
modes.
tDLL = tIOD,max − tRW EBD,min + tIOS (2)
4.1 Proposed Interface Architecture where tRW EBD is the propagation delay of RWEB from
Fig. 5 shows the block diagram of the proposed DDR the input port of the NAND flash memory to the DLL,
synchronous NAND flash memory interface. As stated and tIOD is the data propagation delay from RLAT to
8 DRAFT
the IO pads of the NAND flash memory. Note that the Plugging Eq. (4) into Eq. (3) gives
small variation in data availability can easily be adjusted
tP,min = max{tOUT + (tREA + tIN + tS ) − tD , tBY T E } (5)
by the DLL block.
which further develops to
4.2 Write/Read Operation and Timing tOUT + (tREA + tIN + tS )
tP,min = max , tBY T E , (6)
Fig. 6 shows the write and read timing diagrams of the 1+α
proposed DDR synchronous NAND flash interface. In by applying Eq. (1) to Eq. (5). The maximum clock
the proposed interface, data is transferred at both rising frequency of the conventional design can then be de-
and falling edges of the RWEB signal in the write mode, termined by Eq. (6).
as represented in Fig. 6(a). The data transfer rate can
thus be improved by a factor of two compared with the 4.3.2 Proposed Interface
conventional design. In the read mode shown in Fig. 6 For the proposed architecture, the value of tP should be
(b), the controller asserts RWEB, instead of REB, to the at least the larger of tRW C and tBY T E , namely
NAND flash memory at tR after issuing the second CMD
is completed. At the same time, the first data is pre- tP,min = max{tRW C , tBY T E }, (7)
fetched to RLAT from the page register. The data are
since tRW C plays the role of tRC .
then moved from RLAT to the IO ports and the DLL
Recall that the parameters tIOS and tIOH represent
block that delays RWEB by tDLL for DVS generation.
the setup and hold time constraints of data with respect
Finally, the controller fetches the data at the falling edge
to DVS at the IO pad of the controller, respectively. By
of DVS. For the next series of data, DVS is generated in a
design, tRW C is identical to the period of DVS, which
similar manner, and the controller fetches at both edges
should be at least twice the sum of tIOS and tIOH , as
of DVS.
shown in Fig. 7(a). In other words. tP,min of the proposed
The major difference of the proposed design with
architecture is given by
respect to the conventional one is the concurrent propa-
gation of control signals and data. Hence, it is possible tP,min = max{(tIOS + tIOH ) × 2, tBY T E }, (8)
for the proposed scheme to have a shorter read cycle
where the term (tIOS + tIOH ) is doubled since the
than the conventional design.
proposed design supports DDR, and a single DVS cycle
should thus be long enough to manage two transfers.
4.3 Determining Operating Clock Period The architecture shown in Fig. 5 assumes that the
To compare the proposed and the conventional architec- controller and the NAND flash memory chips are in-
tures in terms of their maximum operating frequency, tegrated into a single board. Thus, tIOS and tIOH are
we calculate the minimum period of the system clock affected by the geometric parameters of the board-level
(i.e. tP,min ) for each architecture. interconnects. When the board-level design parameters
are available, we can derive an alternative representation
4.3.1 Conventional Interface of tP,min given by
By design, tP should be at least the larger of tRC and tP,min = max {(tS + tH + tDIF F ) × 2, tBY T E } , (9)
tW C , which are the periods of REB and WEB, respec-
where tS and tH are the setup and hold times of RFIFO,
tively. From Section 3.4 recall that tRC > tW C since the
respectively, and tDIF F is the difference between the
propagation of REB and data should be serialized and
arrival time of DVS to RFIFO and the arrival time of
happen within the same cycle in the read mode. Thus,
IO in the NAND flash memory to RFIFO. As informally
we can ignore tW C for computing tP,min .
shown in Fig. 7(b), tDIF F is caused by the different
To determine tP,min , we also need to consider tBY T E
interconnect delays of DVS and IO at the board level.
since the data transfer between the page register and
In Eq. (9), note that tS and tH are independent of the
RLAT occurs in a distinct clock cycle that precedes the
geometric parameters of the board and that tDIF F also
REB and data propagation. If this tBY T E parameter is
becomes a constant once the geometric parameters of the
greater than tRC , tP,min should be determined by tBY T E .
interconnects at the board have been decided.
Consequently, tP,min is given by
The maximum clock frequency of the proposed design
tP,min = max{tRC , tBY T E }. (3) can be determined from either Eq. (8) or Eq. (9).
tRWC
RWEB tPROG
tDS tDH tDS tDH tDS tDH tDS tDH
DDR IO[7:0]
RWEB
tDLL
DVS
tR
DDR IO[7:0]
Fig. 6: Timing diagrams of the proposed DDR synchronous NAND flash interface.
TABLE 2: NAND flash memory timing parameter values important performance metrics for comparing different
used in the experiments. SSDs, and ii) energy consumption.
Parameters Conventional (ns) Proposed (ns) Throughout the two sets of experiments detailed in
Sections 5.3.1 and 5.3.2, we wanted to see how the
tOU T 7.82 N/A
proposed architecture can guide the design decisions
tIN 1.65 N/A
about the internal channel architecture; this is critical
tS 0.25 0.25
since it can trade-off between the area and performance
tH 0.02 0.02
of the SSD under design.
tDIF F N/A 4.69
Three different interface designs were implemented
tREA 20 N/A
and compared: the conventional asynchronous interface
tBY T E 12 12
outlined in Section 3, the synchronous (but not double-
data-rate) interface proposed in [23] and the proposed
synchronous double-data-rate interface explained in Sec-
the performance gap between the proposed and the
tion 4. In this section, these designs are referred to as
conventional architectures would become wider.
CONV, SYNC ONLY and PROPOSED, respectively.
For the workload used in the experiments, we used
For convenience in implementation, the SYNC ONLY
widely used sequential traces that consist of 64-KB
architecture was not developed from the scratch but was
read/write data chunks [30]. The sequential traces repre-
derived from PROPOSED by replacing DDR transfers
sent the typical access patterns happening when a large
with single-data-rate transfers. The operating frequency
volume of data is written to or read from a storage based
of SYNC ONLY was thus set to 83 MHz.
on NAND flash memory. As host interface, the SATA
interface1 was used. Finally, the overall SSD system was 5.3.1 Architectures with Different Way Interleaving
modeled at behavior level, and all the aforementioned
We designed single-channel SSDs with five different
models were integrated using MentorGraphics Seam-
degrees of way interleaving: 1-way, 2-way, 4-way, 8-
less [31].
way and 16-way. The write and read performance of
each design was then measured for the three competing
5.2 Operating Frequency Determination interfaces and the two flash cell types, as shown in
Using the simulators we developed, the major timing Fig. 8 and Table 3. The experimental results we obtained
parameters of the proposed and the conventional inter- clearly indicate that the proposed design greatly im-
face architectures were measured, as listed in Table 2. proves the system performance in corporation with the
The value of tDIF F was measured using CubicWare [32], way-interleaving technique, as detailed below.
[33]; the difference of the loading capacitances of DVS • Case I (write, SLC): We first consider the SLC cases
and IO at the board set to 30 pF. The values of tS and shown in Fig. 8(a). For the 1-way design, the write per-
tH are identical for both architectures since they were formance of CONV and PROPOSED is similar, the latter
synthesized with the same library. Note that only the being better only by 9%. This marginal improvement
first five parameters in the table were obtained from originates from the fact that the data transfer time from
measurements; the rest are from the specification of the SSD controller to the NAND flash memory is much
NAND flash chips [26], [27], [28]. smaller than the cell program time tP ROG of the NAND
For the conventional SSD, the minimum data access flash memory. What PROPOSED reduces is the data
period tP,min n defined in Eq. (6) can transfer time, rather than tP ROG . By Amdahls’ law, the
o be evaluated as impact of reducing the data transfer time on the overall
7.82+20+1.65+0.25
tP,min = max , 12 = 19.81 nanosec-
1+0.5 performance is therefore diminished by the dominant
onds (ns) with the value of α = 0.5. Based on this, the size of tP ROG .
maximum data access rate of the conventional design However, as the degree of way interleaving is in-
was set to 50 MHz. For the proposed design, Eq. (9) is creased, the advantage of using PROPOSED becomes
evaluated as tP,min = max{0.25 + 0.2 + 4.69, 12} = 12 more evident. For CONV, the performance gain by way
ns, and the maximum data access rate of the proposed interleaving decreases as the number of ways increases,
design was set to 83 MHz. eventually being saturated at the 8-way design. In con-
trast, for PROPOSED, the interleaving effect was main-
5.3 SSD-Level Performance Analysis tained throughout all the degrees of way interleaving.
Note that CONV achieved only about 5x performance
We compared and contrasted the performance of the
gain as the number of ways changed from 1 to 16,
SSDs designed with the proposed synchronous DDR
whereas the performance gain by PROPOSED was more
interface with that of the SSDs using the conventional
than 11x under the same condition. For the 16-way
interface. The comparison criteria used were i) the write
design, PROPOSED outperformed CONV by 2.45 times.
and read speeds, which have become one of the most
This difference is caused by the fact that PROPOSED
1. We used SATA2 or “SATA 3 Gbit/s,” which supports the band- enables the controller to put more data in a fixed amount
width of up to 300 MB/s. of time (i.e. tP ROG ) than CONV.
CHUNG et al.: A HIGH-PERFORMANCE SOLID-STATE DISK WITH DOUBLE-DATA-RATE NAND FLASH MEMORY 11
Write
8 39.78 55.36 63.00 1.14 1.58
40.00
16 39.76 60.44 97.35 1.61 2.45
20.00 Mean‡ 26.29 34.53 44.13 1.16 1.42
SLC
1 27.78 36.66 47.89 1.31 1.72
0.00
1-way 2-way 4-way 8-way 16-way 1-way 2-way 4-way 8-way 16-way 2 42.78 67.16 70.47 1.05 1.65
Write Read 4 42.75 67.13 117.68 1.75 2.75
Read
8 42.72 67.11 117.64 1.75 2.75
16 42.69 67.11 117.59 1.75 2.75
(a) Single-Level Cell Mean 39.74 61.03 94.25 1.49 2.26
1 4.43 4.55 4.65 1.02 1.05
2 8.36 8.85 9.24 1.04 1.11
120.00
4 15.24 16.75 18.13 1.08 1.19
CONV SYNC_ONLY PROPOSED Write
100.00 8 25.86 29.72 34.08 1.15 1.32
16 32.45 45.99 57.23 1.24 1.76
80.00
Mean 17.27 21.17 24.67 1.11 1.26
MLC
60.00 1 26.04 33.58 42.69 1.27 1.64
MB/s
Fig. 8: Write/read speed of single-channel SSDs designed • Case III (write/read, MLC): Fig. 8(b) shows the results
with different degrees of way interleaving (see Table 3 for the MLC NAND flash memory design. The read time
for more details). (tR ) and the program time (tP ROG ) parameters of MLC
devices are much larger than those of SLC devices. Thus,
the effect of way interleaving on the overall performance
The performance of SYNC ONLY lied between those decreases in MLC devices for the same degree of way
of CONV and PROPOSED, as expected from the fact that interleaving. This reduction in the effectiveness of way
SYNC ONLY does not support double-data-rate data interleaving is larger in the write mode than in the read
transfers. mode, since tP ROG is much larger than tR . This result
• Case II (read, SLC): This case is shown in the indicates that the proposed interface combined with the
right-hand side of Fig. 8(a). The overall performance interleaving technique can be more effective for high-
of reading was higher than that of writing for all the capacity storage devices that are composed of many
three interfaces tested. By design, the way-interleaving MLC chips than for low-capacity storages. We can also
technique can fully be effective during tR in the read deduce that the proposed design is more advantageous
mode, while it does not fully utilize tP ROG in the write for storage devices with many low-density MLC chips
mode. Even in this case, the way-interleaving technique than for storages with a small number of high-density
is more effective to PROPOSED, since the performance MLC chips.
of PROPOSED is saturated at the larger degree of way-
interleaving compared to CONV. Namely, PROPOSED 5.3.2 Architectures with Various Channel Configurations
and CONV are saturated when the degrees of way in- In practice, the capacity of a storage system is typically
terleaving are 4-way and 2-way, respectively. The relative determined earlier than micro-architectural design pa-
performance of PROPOSED over CONV in the read rameters such as the number of ways and channels.
mode was also higher than that in the write mode for Given a capacity value, we can explore the various
all degrees of way interleaving. For instance, PROPOSED combinations of ways and channels to search for optimal
outperformed CONV by a factor of 2.75 for the 16-way design. In this regard, we tested three different SSD
design. architectures of varying channel/way configurations
12 DRAFT
200.00 Write
150.00
4-4 103.76 115.68 123.52 1.07 1.19
100.00
Mean‡ 72.53 92.70 111.90 1.25 1.65
SLC
50.00 1-16 42.69 67.11 117.59 1.75 2.75
0.00 2-8 81.44 126.70 224.82 1.77 2.76
Read
1CH 16W 2CH 8W 4CH 4W 1CH 16W 2CH 8W 4CH 4W
4-4 155.35 237.61 max§ – –
Write Read
Mean 93.16 143.81 235.25 1.76 2.76
1-16 32.45 45.99 57.23 1.24 1.76
(a) Single-Level Cell 2-8 48.72 56.83 64.75 1.14 1.33
Write
4-4 57.46 63.55 68.49 1.08 1.19
Mean 46.21 55.46 63.49 1.15 1.41
400.00
MLC
1-16 41.50 64.73 110.52 1.71 2.66
350.00 CONV SYNC_ONLY PROPOSED
2-8 79.32 122.48 201.42 1.64 2.54
300.00 Read
4-4 150.94 230.17 max – –
250.00
Mean 90.59 139.13 217.18 1.68 2.60
200.00
MB/s
50.00
columns 7–8.
§ Reached the maximum bandwidth of the SATA interface.
0.00
1CH 16W 2CH 8W 4CH 4W 1CH 16W 2CH 8W 4CH 4W
Write Read
[7] T. Hara, K. Fukuda, K. Kanazawa, N. Shibata, K. Hosono, H. Mae- for NAND type flash memory systems,” in Proceedings of the Third
jima, M. Nakagawa, T. Abe, M. Kojima, M. Fujiu et al., “A 146- International Conference on Information Technology and Applications
mm 2 8-Gb multi-level NAND flash memory with 70-nm CMOS (ICITA 2005), vol. 1, 2005.
technology,” IEEE Journal of Solid-State Circuits, vol. 41, no. 1, pp. [20] W. Huang, C. Chen, and C. Chen, “The Real-Time Compression
161–169, 2006. Layer for Flash Memory in Mobile Multimedia Devices,” in
[8] K. Takeuchi, Y. Kameda, S. Fujimura, H. Otake, K. Hosono, Proceedings of International Conference on Multimedia and Ubiquitous
H. Shiga, Y. Watanabe, and T. Futatsuyama, “A 56-nm CMOS Engineering (MUE’07), 2007, pp. 171–176.
99-Formula Not Shown 8-Gb Multi-Level NAND Flash Memory [21] L. Chang and T. Kuo, “An adaptive striping architecture for flash
With 10-MB/s Program Throughput,” IEEE Journal of Solid-State memory storage systems of embedded systems,” in Proceedings of
Circuits, vol. 42, no. 1, p. 219, 2007. the Eighth IEEE Real-Time and Embedded Technology and Applications
[9] J. Kim, J. Kim, S. Noh, S. Min, and Y. Cho, “A space-efficient flash Symposium, 2002, pp. 187–196.
translation layer for compactflash systems,” IEEE Transactions on [22] S. Lim and K. Park, “An efficient NAND flash file system for
Consumer Electronics, vol. 48, no. 2, pp. 366–375, 2002. flash memory storage,” IEEE Transactions on Computers, pp. 906–
[10] S. Kim and S. Jung, “A log-based flash translation layer for 912, 2006.
large NAND flash memory,” in Proceedings of the 8th International [23] C. Son, S. Yoon, S. Chung, C. Park, and E. Chung, “Variability-
Conference (ICACT 2006), 2006, pp. 1641–1644. insensitive scheme for NAND flash memory interfaces,” Electron-
[11] C. Wu and T. Kuo, “An adaptive two-level management for the ics Letters, vol. 42, no. 23, pp. 1335–1336, 2006.
flash translation layer in embedded systems,” in Proceedings of the [24] http://www.onfi.org.
2006 IEEE/ACM international conference on Computer-aided design. [25] R. Schuetz, H. Oh, J. Kim, H. Pyeon, S. Przybylski, and P. Gilling-
ACM New York, NY, USA, 2006, pp. 601–606. ham, “Hyperlink nand flash architecture for mass storage appli-
[12] M. Assar, S. Nemazie, P. Estakhri et al., “Flash memory mass stor- cations,” in Proceedings of IEEE Non-Volatile Semiconductor Memory
age architecture incorporation wear leveling technique,” Dec. 26 Workshop, 2007, pp. 3–4.
1995, united States Patent 5,479,638. [26] K9F1G08U0B 128M x 8-bit NAND Flash Memory Data Sheet V1.0,
[13] http://www.serialata.org. Samsung Electronics Company, 2006.
[14] D. Kim, K. Bang, S. Ha, C. Park, S. Chung, and E. Chung, “Solid- [27] K9GAG08U0M 2G x 8-bit NAND Flash Memory Data Sheet V1.0,
State Disk with Double Data Rate DRAM Interface for High- Samsung Electronics Company, 2006.
Performance PCs,” IEICE Trans. on Information and Systems, vol. [28] FK8G16Q2M 2G 2Gb MuxOneNAND M-die Data Sheet V1.1, Sam-
E92-D, no. 4, pp. 727–731, 2009. sung Electronics Company, 2007.
[15] C. Park, P. Talawar, D. Won, M. Jung, J. Im, S. Kim, and Y. Choi, [29] http://www.synopsys.com/Tools/Implementation/SignOff/
“A high performance controller for NAND Flash-based solid state Pages/PrimeTime.aspx.
disk (NSSD),” in Proceedings of the 21st Non-Volatile Semiconductor [30] MultiMediaCard System Specification Version 4.2, MMCA MultiMe-
Memory Workshop (IEEE NVSMW), 2006, pp. 17–20. diaCard Association, 2006.
[16] D. Ryu, “Solid state disk controller apparatus,” Dec. 19 2005, [31] http://http://www.mentor.com/products/fv/seamless/.
united States Patent App. 11/311,990. [32] M. Jang, H. Jin, B. Lee, J. Lee, S. Song, T. Kim, and J. Kong,
[17] J. Lee and D. Ryu, “Semiconductor solid state disk controller,” “CubicWare: a hierarchical design system for deep submicron
Nov. 9 2006, united States Patent App. 11/594,893. ASIC,” in In Proceedings of the Twelfth Annual IEEE International
[18] K. Yim, H. Bahn, and K. Koh, “A flash compression layer for ASIC/SOC Conference, 1999, pp. 168–172.
SmartMedia card systems,” IEEE Transactions on consumer Elec- [33] http://www.samsung.com/global/business/semiconductor/
tronics, vol. 50, no. 1, pp. 192–197, 2004. products/asic/Products EDASupport.html.
[19] W. Huang, C. Chen, Y. Chen, and C. Chen, “A compression layer