QUANTIZING TIME SERIES FOR EFFICIENT SIMILARITY SEARCH
UNDER TIME WARPING
In´es F. Vega-L´opez Bongki Moon
School of Informatics Department of Computer Science
Autonomous University of Sinaloa University of Arizona
Culiac´an, Sinaloa, M´exico Tucson, AZ, USA
email:
[email protected] email:
[email protected]
ABSTRACT relevant time series can be retrieved without looking up the
Indexing Time Series Data is an interesting problem entire database exhaustively. It is easy to propose a simple
that has attracted much interest in the research community technique for evaluating similarity search queries in which
for the last decade. Traditional indexing methods organize a time series data of length is mapped into a point in
the data space using different metrics. For time series, how- -dimensional space, and a spatial access method such as
ever, there are some cases when a metric is not suited for the R-tree [2, 10] is used to index them. However, it is
properly assessing the similarity between sequences. For not uncommon that time series data are sampled during a
instance, to detect similarities between sequences that are relatively long period of time. Thus, a direct application
locally out of phase Dynamic Time Warping (DTW) must of spatial access methods will require mapping time series
be used. DTW is not a metric as it does not satisfy the trian- data into a very high dimensional vector space. In con-
gular inequality. Therefore, traditional spatial access meth- sequence, this approach would likely suffer performance
ods cannot be used without introducing false dismissals. In degradation due to reduced pruning power of an index, a
such cases, alternative methods for organizing and search- phenomenon known as the curse of dimensionality.
ing time series data must be proposed. In this paper we To address the problem of indexing large multime-
propose the use of quantization to generate small and ho- dia objects, the general approach is to apply a transforma-
mogeneous representations of time series. We compute tion that produces an indexable representation of the mul-
upper- and lower-bounds on the DTW distance to a query timedia object [8]. This representation is usually a vector
sequence using this quantized representation to filter-out that conveys relevant information about the main charac-
sequences that cannot be a best match for the query. In the teristics of the object and it is usually known as feature
proposed approach, efficient search is achieved by organiz- or signature vector. Feature extraction from time series
ing the quantized representation of data in a linear array data has been known as dimensionality reduction because
that can be efficiently read from disk. The computational it generally represents a time series with a small number
cost of processing the query is shadowed by the IO cost re- of values. Many promising techniques have been pro-
quired to scan the file containing the linear array and it does posed to generate a small representation of a time series
affect the total query cost. data for improving similarity search performance. Among
these techniques, we can cite Discrete Fourier Transforma-
KEY WORDS tion [1], Discrete Wavelet Transformation [4, 21, 25], Sin-
Time Series, Time Warping, Similarity Search. gular Value Decomposition [5], Segmentation [26, 14, 17],
and Quantization [24, 23]. Most of these techniques orga-
nize and search time series data using the Euclidean dis-
1 Introduction tance as the metric for similarity.
There are some cases where Euclidean distance, and
A time series data is a potentially long sequence of val- norms in general, may not be entirely adequate for es-
ues, each of which represents the measurement of an at- timating similarity. The reason is that norms are sensi-
tribute of interest at a point in time. Due to the time- tive to distortions in the time axis. To avoid this problem,
varying nature of the universe, there are countless exam- similarity models should allow some elastic shifting of the
ples of time series data from diverse sources and applica- time dimension to detect similar shapes that are locally out
tions, such as stock prices, currency exchange rates, elec- of phase [6]. This is the case of Dynamic Time Warping
trocardiograms, environmental data, and gene expression (DTW), introduced to the context of time series by Berndt
measurements. With the growing popularity of time series and Clifford [3]. DTW is used when the size of two time
data, there is an increasing demand to support fast retrieval series is different or when it is required to match sequences
of time series data based on similarity measurements. Sim- that are out of phase.
ilarity search, or query by content, on time series data is In this paper we propose the use of quantization for
important to many application domains such as informa- efficient similarity search of time series under time warp-
tion retrieval, data mining and clustering. Such queries are ing. We recognize that this problem is heavily bound by
used when applications need to compare the temporal evo- IO operations and address this issue in a twofold manner.
lution of data to a particular pattern known as the query First, we minimize the time spent accessing the index at
pattern. For example, a fund manager of a stock brokerage search time. Second, we reduce the number of data time
firm may be interested in finding all stocks whose prices series that must be fetched to guarantee an exact answer
moved similarly to that of a particular stock or following a to the query. To address the latter, we propose the use of
certain pattern (e.g., head-and-shoulder). quantization. Quantization provides a compact and homo-
To efficiently evaluate similarity search queries, it is geneous approximation of time series that is more accurate
important to organize time series data in such a way that than other approximations. To minimize the index access
1
time, we organize the quantized representations of the data to answer dynamic time warping queries without introd-
time series in a linear index. This allow us to take advan- icing false dismissals. Essentially, a constrained warping
tage of fast sequential disk accesses and reduce the time path creates an envelope around the query time series. This
spent during index search as a consequence. envelope can be approximated using a modified Piecewise
The rest of the paper is organized as follows. In the Aggregate Approximation (PAA), which fully contains ev-
next section we provide an overview of the most influential ery point in the envelope (i.e., using the MIN and MAX ag-
work for processing similarity search queries under time gregate functions instead of AVG). This approximation to
warping. In Section 3 we describe our proposal for effi- the envelope can then be used to define a distance func-
ciently processing this class of queries. Section 4 describes tion that lower-bounds the dynamic time warping distance.
the experimental evaluation of the proposed technique. Fi- The concept of envelope was later improved by Zhu and
nally, in Section 5 we summarize the benefits of our pro- Shasha [28]. They proved that the approximation to the
posal. envelope does not need to contain every point in the en-
velope but only every approximated point in the envelope.
This property is termed container-invariant. A container-
2 Previous Work invariant transformation of an envelope guarantees no false
dismissals during dynamic time warping queries [28].
Using Dynamic Time Warping for measuring similarity be-
tween sequences is problematic for two reasons. First,
series of length and takes
DTW has a high computational cost, comparing two time
time. Second, 3 A new Technique for Efficient Similarity
DTW is not a metric as it does not satisfy the triangular in- Search under Time Warping
equality. Therefore, spatial access methods cannot be used
without introducing false dismissals [27]. These two is- Our proposal for the efficient processing of similarity
sues have been studied and different solutions have been search on time series data under time warping is twofold.
proposed by the research community. In this section we
provide a summary of the most influential work addressing First, we propose to use quantization to generate a small
these problems. and accurate approximation of time series data. Previous
research work has shown that quantization provides better
To reduce its high computational cost, Yi et al. [27] approximation to time series under a variety of scenarios
proposed the use of FastMap [9] in combination with a than other dimensionality reduction techniques [18]. Sec-
lower-bounding distance function to DTW based on the ond, indexing large vectors is problematic due to the curse
minimum and maximum values of a sequence. FastMap is of dimensionality. In particular, the performance of hierar-
used to reduce the dimensionality of the objects being com- chical index degrades as we increase the dimensionality of
pared and, in consequence, the computational cost of DTW. the search space. To reduce the time spend accessing the
The lower-bounding distance is used to prune away non- index, we favor the use of a linear index over a hierarchical
qualifying elements. However, because the Dynamic Time index.
Warping distance does not satisfy the triangle inequality,the
use of FastMap might result in a number of false dismissals.
The problem of the high cost of DTW was also ad-
dressed by Keogh et al. [16]. They introduced the Piece- 3.1 Quantizing Time Series
wise Dynamic Time Warping (PDTW) distance, which ap-
proximates the dynamic time warping distance using only To quantize time series data, we use the Self COntained
the Piecewise Aggregate Approximation (PAA) [14] data Bit Encoding (SCoBE) introduced by Lopez and Moon as
representation of the time series being compared. Because it produces better approximations than other quantization
it only provides an approximation to the actual DTW dis- techniques when applied to time series data [18]. SCoBE
tance, this approach introduces false dismissals. In addi- approximates a time series data by first segmenting it into
tion, it is sensitive to the arguments provided by the user disjoint subsequences. For each subsequence, the mini-
regarding the compression ratio of the data representation. mum and maximum observed values are recorded. Once
That is, the number of resulting false dismissals is sensitive the subsequences have been defined1 , we quantize the val-
to the number of segments used for the PAA representation ues in each segment with respect to its range. For this, we
(PAA is a lossy compression technique, the approximation
error increases as we reduce the number of segments). A partition the range into
cells, where is a user defined
similar approach was presented by Chan et al. [5] using the number of bits, and assign a bit encoding to each of these
cells. Each value in the segment is then approximated by
Haar Wavelet Transformation as their data transformation.
The model they proposed is called Low Resolution Time the bit string (i.e., between 0 and
) representing the
Warping and it provides an approximation of the DTW be- cell where the value falls.
tween the two sequences being compared. Because it only When a data object is quantized, its values are no
approximates DTW, this model cannot guarantee no false longer represented by a point in space but by a bounding
dismissals either. region. Using this region, it is easy to define both upper-
Park et al. [20] proposed an indexing technique for and lower-bound distances to a query sequence for any
similarity search on time series data based on time warp- norm. Given a query time series and a data object ,
ing distance. Their approach is based on categorizing the both of length , and the bounding region correspond-
!
values found in a time series to create a string. The strings ing to the quantized approximation of , , we define
so obtained are inserted into a disk-based suffix tree for in-
dexing and searching.
distance
a lower-bound distance
" #$ %!
as follows.
and an upper-bound
Keogh [13] observed that the suffix-tree used by
Park et al. can be orders of magnitude larger than the orig-
inal data and proposed an indexing technique based on a 1 Lopez and Moon proposed several segmentation schemes to reduce
restricted time warping distance. He showed that, by limit- the approximation error. We do not discuss such schemes here as it is not
ing the warping path, an index-based approach can be used the focus of this paper.
Definition 1 (Lower-bound distance) would take
-,
./10
time to find the first nearest neigh-
0
bor in a database with objects, organizing a high dimen-
sional space severelly affects index performance. In Fact,
(1) previous research work indicates that for high dimensional
data, searching in a hierarchical index reduces to a linear
search [24]. That is a large portion of the index needs to
if be accessed during search. Because the index pages are
fetched randomly from disk, the query performance suffers
if due to the large IO cost. Instead of using a hierarchical in-
otherwise dex, we organize the quantized representations of the data
time series sequentially in a file. During search, we se-
where and represent the top and bottom edges of the quentially scan this file. Reading a file sequentially is more
efficient because of pre-fetching. In addition, the compu-
bounding region defined by the quantized representation of
. tational cost of processing the query can be masked-out by
the IO cost thanks to the use of direct memory access.
Definition 2 (Upper-bound distance) A two-step algorithm was presented by Weber et
al. [24] for 2 -NN search on the VA-File (a linear index
similar to ours). We use a modified version of this algo-
" # %
rithm for exact 2 -NN search under time warping. We use
"!$# % (2) Weber’s algorithm to build a list of candidates ranked by
their lower-bound distance to the envelope around query
object. The data of all objects in the candidate list with
&!'# )( lower-bound distance smaller than the actual 2 -NN dis-
tance are fetched from disk. These data are used to compute
A lower-bound distance guarantees no false dis- the DTW distance from the object to the query. If this dis-
missals when evaluating a similarity search query [8]. An tance is smaller than the 2 -NN distance, the corresponding
upper-bound distance, on the other hand, is not necessary object is included in the answer to the query. If by inserting
for finding correct search results. However, it can be used a new element into the answer set, its size grows larger than
to further reduce the search space. 2 , one of them is removed from the set. Only the 2 closest
objects are kept in the answer set.
3.2 * -NN Search under Time Warping
4 Performance Evaluation
Keogh [13] proved that the dynamic time warping distance
between two time series can be exactly indexed (i.e., with- In this section, we empirically demonstrate the perfor-
out introducing false dismissals) by constraining its warp- mance benefits of using quantization organized in a lin-
ing path. That is, when comparing two time series, the ear index for similarity search under time warping. In our
displacement that a particular value in the series can have experiments, we used 5 data sets from different sources.
along the time dimension is restricted. For a query time Three different performance metrics were used to evalu-
series , this constrained path creates an envelope around ate the performance of the different techniques presented
it. Let us denote the envelope around by + . Keogh in this experimental study.
showed that the Euclidean distance between a time series
to + is a lower-bound on the constrained DTW dis-
tance between and . Because this lower-bound distance 4.1 Data Sets
guarantees no false dismissals, the envelope created by the
constrained warping path on a query sequence can be used Our test data sets were obtained from sources of different
fields such as medicine, finance, and astrophysics. We did
to answer exact similarity queries under time warping.
In this work, we extended Keogh’s envelope idea for this in an effort to minimize data bias2 in our experiments.
exact similarity search under time warping. Keogh orga- Four data sets were from real world data sources and one
was synthetically generated. From each data source we
nizes the data time series by inserting their PAA approx- generated data sets each of which contained time series ob-
imation into a hierarchical index. To compute a lower- jects of length 256. Each data set contains 100,000 time
bound on the DTW distance, he must apply the same ap- series objects of the same length. We randomly extracted
proximation to the query envelope. As a consequence 1,000 entries from each data set. This subset of 1,000 time
of approximating the envelope, the quality of the lower- series objects became our query set. We did this to avoid
bonding distance deteriorate and the number of false alarms exact matches with queries in our experiments. In addition,
increases. To keep the number of false alarms down to a this practice allowed us to use a query object that is in the
minimum, we propose to use the query envelope without same domain as the data set. Details of our testing data are
approximating it. The raw envelope is compatible with given next.
the quantized approximation of the data time series and
a lower-bounding distance can be directly computed. To (1) Plasma: We obtained this data from the Coordi-
achieve this, We use the Euclidean distance between the nated Heliospheric Observations Web server (CO-
raw envelope and the quantized representation of the data HOWeb) [7]. COHOWeb provides access to hourly
time series. Equations 1 and 2 can be easily modified to resolution magnetic field and plasma data from 12 dif-
compute a lower bound on the Euclidean distance when the ferent heliospheric spacecraft. In our experiments, we
query object is a region (i.e., the query envelope) instead of used the plasma temperatures reported by these space-
a point. craft since 1963.
To efficiently process the search, we avoid the use of
hierarchical indexes as they are prone to suffer from the 2 Data bias is the conscious or unconscious use of a particular set of
curse of dimensionality. While an ideal hierarchical index testing data to confirm a desired finding [15].
(2) EEG: This data arose from a large study to exam- time dimension. Given two time series and , this band
ine electroencephalogram correlations of genetic pre- indicates that
value in can only be compared to values
disposition to alcoholism. It contains measurements
to
in , where
is the value of time series
from 64 electrodes placed on subject’s scalps which
were sampled at 256 Hz for 1 second [11].
at time , and is time displacement allowed by the band.
In our experiments, we set the Sakoe-Chiba band to allow
(3) ECG: This data set was generated from electrocardio-
the
time series to stretch about
on the time line (i.e.,
).
gram data obtained from the MIT-BIH database distri-
bution [19].
(1) Keogh’s Envelope: We implemented Keogh’s enve-
(4) Mixed Bag: This data set was generated from seven lope as indicated in [13]. We applied the Piecewise
different data sets. Since the source data sets have dif- Aggregate Approximation (PAA) to every entry in the
ferent properties [17], a preprocessing step was per- data set and inserted the corresponding data approxi-
formed to normalize the data. The resulting data set mation into a disk-resident R-tree.
has a mean of zero and standard deviation of one.
(5) FinTime: Using the financial time series bench- (2) Zhu’s Improved Envelope: We implemented the con-
mark [12], we synthetically generated stock values for tainer invariant envelope as described by Zhu ans
100,000 companies. While the synthetic data genera- Shasha in [28]. We applied PAA to every envelope and
tor provides stock values at opening and closing times, inserted the corresponding entry into a disk-resident
we only used the closing time values in our experi- R-tree.
ments.
(3) Quantization: We generated an envelope for the
query sequence and used the SCoBE approximation
4.2 Performance Metrics on all the data time series to obtain a compact and
accurate representation. While any quantization such
In our experiments, we evaluated the efficiency of differ- as VA can be used, previous research work proved this
ent techniques using three metrics. We measured the index quantization works better for similarity search on time
search overhead and the number of data objects fetched as series [18]. The data approximations are laid out in a
the main factors affecting overall performance of similarity linear index and processed sequentially during search.
search. We also measured elapsed time as the performance
metric directly perceived by the user.
All indexes were built using pages of 8 KBytes. We
(1) Index search overhead: is the number of disk pages used 32-bit words for storing the values of time series data.
that must be read from the index when processing a The experiments were performed on Intel Pentium com-
similarity search query. puters with 600 MHz processors and Linux operating sys-
tem. Each computer has 128 MBytes of main memory and
(2) Data objects fetched: It has been suggested that we 9 GBytes of disk storage connected with a SCSI interface.
should prevent performance bias due to disparities in For Keogh’s and Zhu’s envelopes, we built indexes
the quality of implementation of the methods being using feature vectors of 8, 16, and 32 dimensions (i.e., the
compared [15]. Following this recommendation, we values of segments in the PAA transformation of the en-
evaluate the effectiveness of different 2 -NN search velope). There is a performance trade-off depending on
methods by the number of data objects fetched from the size of the feature vector. On one hand, a higher-
the database. The number of data objects fetched dur- dimensional vector provides a more accurate representation
ing a 2 -NN query represents a performance metric that reducing the data access overhead. At the same time, it
is independent of the quality of implementation. degrades the performance of the hierarchical index by in-
(3) Elapsed time: We used wall-clock time to measure the creasing the number of index pages that must be accessed
elapsed time during the evaluation of 2 -NN queries. (i.e., the curse of dimensionality). As we know, the overall
This time includes both CPU and IO time. In addition, search performance depends largely on these two factors.
for each query we recorded the time spent on CPU op- By varying the size of the feature vector, we were able to
erations only. This allows us to estimate the time spent identify the best case for each data set. The results shown
in IO operations. Finally, to avoid any caching effects in this section represent the best observed performance for
on the index and data files, we flushed the main mem- Keogh’s envelope and Zhu’s Improved envelope.
ory between consecutive queries by loading irrelevant The quantizing approximations do not reduce dimen-
data into the system. sionality explicitly. Instead, the size of the data represen-
tation is reduced by reducing the resolution of each value
For each metric used in our experiments, we present (i.e., quantizing it). In our experiments, we used approx-
the average value after executing 100 2 -NN queries. imations of the size of the data objects. Note that this
achieves the same compression ratio as using 32 segments
on the PAA representation.
4.3 Evaluated Techniques
We have evaluated DTW queries using quantization and
compared our results to those obtained using Keogh’s enve- 4.4 Experimental Results
lope. We also include Zhu and Shasha’s improved envelope
in our empirical evaluation. In our experiments, we cre- In our experiments, we extensively evaluated the perfor-
ated an envelope around the query time series by limiting mance of the techniques described in Section 4.3. We mea-
the warping path using the Sakoe-Chiba band [22]. When sured the performance of these techniques on different data
comparing two time series, this band restricts the displace- sets. The number of objects in each data set remained fixed
ment that a particular value in the series can have along the at 99,000 entries.
Data Index Search Overhead (pages) Data Elapsed Time (seconds)
set Envelope Improved E. Quantized set Envelope Improved E. Quantized
Plasma 4,185 4,073 3,094 Plasma 124.323 107.423 56.492
EEG 4,637 4,524 3,094 EEG 152.377 123.649 66.638
ECG 414 280 3,094 ECG 6.404 3.617 2.422
Mixed Bag 1,329 648 3,094 Mixed Bag 21.919 17.714 11.036
FinTime 639 244 3,094 FinTime 6.126 3.341 2.147
(a) Number of Disk Pages Fetched from the Index Table 2. Elapsed time for 10-NN Dynamic Time Warping
Queries. Results are Average from 100 Queries.
Data Objects Fetched
set Envelope Improved E. Quantized
Plasma 29,928 25,338 22,119 approach, consistently fetched the smallest number of time
EEG 23,965 19,000 16,369 series objects on all data sets.
ECG 1,063 468 188
Mixed Bag 4,358 4,080 3,580
FinTime 275 209 119 4.4.3 Elapsed Time
(b) Number of Data Objects Fetched To this point, we have presented performance metrics
aimed to understand the behavior of the different similarity
Table 1. IO Performance for 10-NN Queries under Time search techniques used in this experimental study. In this
Warping. Results are Average from 100 Queries. Index section, we conclude our experimental evaluation by pre-
Pages from the Quantized Approach are Read Sequentially. senting the performance as perceived by the user, and show
the elapsed time measured by wall-clock. We loaded irrel-
evant data into the system to flush the entire memory be-
4.4.1 Index Search Overhead tween consecutive queries and avoid caching effects. Our
results are summarized in Table 2.
In Table 1(a), we present our observations on the Index Table 2 shows that quantization consistently outper-
Search Overhead for all the techniques used in this study. formed the state of the art techniques for DTW queries.
This table shows the number of IO operations on the in- This was the result of combining efficient accesses to the
dex while processing the query. We observe that using a index (being sequentially read) with an approximation that
hierarchical index can have a negative effect on the search required fetching a smaller number of data objects than
performance. In particular, a large number of index pages other methods.
are read by the envelope and improved envelope approach
for the Plasma and EEG datasets. In addition, we should re-
member that a hierarchical index is accessed using random 5 Conclusions
IO. Therefore, if a large portion of the hierarchical index is
accessed during similarity search, the performance benefits In this paper, we have presented and approach for effi-
of having an index structure are nullified due to the high ciently processing similarity search on time series data un-
cost of random IO operations. On the other hand, a lin- der time warping. Our approach is based on quantization
ear index (e.g., like the index used with quantization in our to generate a small and accurate approximation of time se-
approach) is accessed sequentially and this access pattern ries data. We have shown that these approximations, orga-
takes advantage of fast sequential IO operations. There- nized in a linear index, outperforms sophisticated transfor-
fore, despite the fact that more index pages are read by our mations and hierarchical indexes during exact 2 -NN DTW
approach for the ECG, Mixed Bag, and FinTime datasets search. We have provided experimental evidence showing
the overall query performance is not significantly affected that, our proposed technique provided a consistently good
and our proposal is still faster than the other methods. performance under a variety of settings.
The effectiveness of our approach for similarity
search search under time warping comes from minimiz-
4.4.2 Data Objects Fetched ing the index search overhead as well as from reducing the
number of data objects fetched. By organizing the quan-
The quality of an approximation can be measured by the tized approximations in a linear index, we can take advan-
number of data objects fetched during search. Tight bound- tage of fast sequential disk accesses. In consequence, we
ing distances provided by the approximation will have a are able to reduce the time spent during index search. In ad-
positive effect on reducing the data search space. In Ta- dition, quantization provides tight upper- and lower-bound
ble 1(b), we present the number of objects that each tech- distances to the data objects. This allows us to filter out a
nique fetched during a -NN query. This table shows the large number of irrelevant entries and to drastically reduce
the data search space.
results on all data sets. Note that the number of data pages
accessed during search can be estimated from the number
of objects fetched. One 8-KByte page stores four data time
series of length 256. Therefore, the number of data pages References
read is given by the following expression.
[1] Rakesh Agrawal, Christos Faloutsos, and Arun N. Swami.
(
( (
(
Efficient Similarity Search in Sequence Databases. In Pro-
ceedings of the FODO Conference, pages 69–84, Evanston,
From Table 1(b), we can observe that our proposed IL, October 1993.
[2] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and [17] Eamonn Keogh, Kaushik Chakrabarti, Sharad Mehrotra,
Bernhard Seeger. The -tree: An Efficient and Robust Ac- and Michael Pazzani. Locally Adaptive Dimensionality Re-
cess Method for Points and Rectangles. In Proceedings of the duction for Indexing Large Time Series Databases. In Pro-
ACM-SIGMOD Conference, pages 322–331, Atlantic City, ceedings of the ACM-SIGMOD Conference, pages 151–162,
NJ, May 1990. Santa Barbara, CA, May 2001.
[3] Donald J. Berndt and James Clifford. Using Dynamic Time [18] Ines F. Vega Lopez and Bongki Moon. A Quantization Ap-
Warping to Find Patterns in Time Series. In Proceedings of proach for Efficient Similarity Search on Time Series Data.
the AAAI Workshop on Knowledge Discovery in Databases, In Proceedings of the International Conference on Inter-
pages 359–370, Seattle, WA, July 1994. net Information Retrieval, pages 182–189, Goyang, Korea,
November 2004.
[4] Kin-Pong Chan and Ada Wai-Chee Fu. Efficient Time Series
Matching by Wavelets. In Proceedings of the International [19] George B. Moody. MIT-BIH Database Distribution.
Conference on Data Engineering, pages 126–133, Sydney, http://ecg.mit.edu/index.html, 1999.
Australia, March 1999. IEEE Computer Society. [20] Sanghyun Park, Wesley W. Chu, Jeehee Yoon, and Chi-
hcheng Hsu. Efficient Searches for Similar Subsequences
[5] King-Pong Chan, Ada Wai-Chee Foo, and Clement Yu. of Different Lengths in Sequence Databases. In Proceedings
Haar Wavelets for Efficient Similarity Search of Time-Series: of the International Conference on Data Engineering, pages
With and Without Time Warping. IEEE Transactions on 23–32, San Diego, California, 28 February - 3 March 2000.
Knowledge and Data Engineering, 15(3):686–705, May– IEEE Computer Society.
June 2003.
[21] Ivan Popivanov. Efficient Similarity Queries over Time
[6] Selina Chu, Eamonn Keogh, David Hart, and Michael Paz- Series Data using Wavelets. Master’s thesis, University of
zani. Iterative Deepening Dynamic Time Warping for Time Toronto, Toronto, Canada, 2001.
Series. In Proceedings of the 2nd SIAM International Con- [22] Hiroaki Sakoe and Seibi Chiba. Dynamic Programming Al-
ference on Data Mining, Arlington, VA, April 2002. gorithm Optimization for Spoken Word Recognition. IEEE
[7] COHOWeb. Deep space hourly merged mag- Transactions on Acoustics, Speech, and Signal Processing,
netic field, plasma, and ephemerides data. 26(1):43–49, February 1978.
http://nssdc.gsfc.nasa.gov/cohoweb/cw.html, [23] Yasushi Sakurai, Masatoshi Yoshikawa, Shunsuke Uemura,
October 2002. and Haruhiko Kojima. The A-Tree: An Index Structure for
[8] Christos Faloutsos. Searching Multimedia Databases By High-dimensional Spaces Using Relative Approximation. In
Content. Kluwer Academic Publishers, Boston, MA, 1996. Proceedings of the VLDB Conference, pages 516–526, Cairo,
Egypt, September 2000.
[9] Christos Faloutsos and King-Ip Lin. FastMap: A Fast Algo-
[24] Roger Weber, Hans-Jorg Schek, and Stephen Blott. A
rithm for Indexing Data-Mining and Visualization of Tradi-
Quantitative Analysis and Performance Study for Similarity-
tional and Multimedia Datasets. In Proceedings of the ACM-
Search Methods in High-Dimensional Spaces. In Proceed-
SIGMOD Conference, pages 163–174, San Jose, CA, May
ings of the VLDB Conference, pages 194–205, New York,
1995.
USA, August 1998.
[10] Antonin Guttman. R-Trees: a Dynamic Index Structure [25] Yi-Leh Wu, Divyakant Agrawal, and Amr El Abbadi. A
for Spatial Searching. In Proceedings of the ACM-SIGMOD Comparison of DFT and DWT Based Similarity Search in
Conference, pages 47–57, Boston, MA, June 1984. Time-Series Databases. In Proceedings of the ACM-CIKM
[11] S. Hettich and S. D. Bay. The UCI KDD Archive. Conference, pages 488–495, McLean, VA, November 2000.
http://kdd.ics.uci.edu, 2002. [26] Byoung-Kee Yi and Christos Faloutsos. Fast Time Sequence
[12] Kaippallimalil J. Jacob and Dennis Shasha. Fin- Indexing for Arbitrary Norms. In Proceedings of the
Time – a Financial Time Series Benchmark. VLDB Conference, pages 385–394, Cairo, Egypt, September
http://cs.nyu.edu/cs/faculty/shasha/fintime.html, 2000.
March 2000. [27] Byoung-Kee Yi, H. V. Jagadish, and Christos Faloutsos.
[13] Eamon Keogh. Exact Indexing of Dynamic Time Warp- Efficient Retrieval of Similar Time Sequences Under Time
ing. In Proceedings of the VLDB Conference, pages 406– Warping. In Proceedings of the International Conference on
417, Hong Kong, China, August 2002. Data Engineering, pages 201–208, Orlando, Florida, Febru-
ary 1998. IEEE Computer Society.
[14] Eamon Keogh, Kaushik Chakrabarti, Michael Pazzani, and [28] Yunyue Zhu and Dennis Shasha. Warping Indexes With
Sharad Mehrotra. Dimensionality Reduction for Fast Sim- Envelope Transforms for Query by Humming. In Proceed-
ilarity Search in Large Time Series Databases. Knowledge ings of the ACM-SIGMOD Conference, pages 181–192, San
and Information Systems, 3(3):263–286, 2000. Diego, CA, June 2003.
[15] Eamon Keogh and Shruti Kasetty. On the Need for Time
Series Data Mining Benchmarks: A Survey and Empirical
Demonstration. In Proceedings of the ACM-SIGKDD Con-
ference, pages 102–111, Edmonton, Alberta, Canada, July
2002.
[16] Eamon Keogh and Michael Pazzani. Scaling up Dynamic
Time Warping for Datamining Applications. In Proceedings
of the ACM-SIGKDD Conference, pages 285–289, Boston,
MA, August 2000.