Quantizing time series for efficient similarity search under timewarping

Ing Fernando Lopez

Outline

Quantizing time series for efficient similarity search under timewarping

Ing Fernando Lopez

2006

Abstract

Indexing Time Series Data is an interesting problem that has attracted much interest in the research community for the last decade. Traditional indexing methods organize the data space using different metrics. For time series, however, there are some cases when a metric is not suited for properly assessing the similarity between sequences. For instance, to detect similarities between sequences that are locally out of phase Dynamic Time Warping (DTW) must be used. DTW is not a metric as it does not satisfy the triangular inequality. Therefore, traditional spatial access methods cannot be used without introducing false dismissals. In such cases, alternative methods for organizing and searching time series data must be proposed. In this paper we propose the use of quantization to generate small and homogeneous representations of time series. We compute upper-and lower-bounds on the DTW distance to a query sequence using this quantized representation to filter-out sequences that cannot be a best match for the query. In the proposed approach, efficient search is achieved by organizing the quantized representation of data in a linear array that can be efficiently read from disk. The computational cost of processing the query is shadowed by the IO cost required to scan the file containing the linear array and it does affect the total query cost.

QUANTIZING TIME SERIES FOR EFFICIENT SIMILARITY SEARCH UNDER TIME WARPING In´es F. Vega-L´opez Bongki Moon School of Informatics Department of Computer Science Autonomous University of Sinaloa University of Arizona Culiac´an, Sinaloa, M´exico Tucson, AZ, USA email: [email protected] email: [email protected] ABSTRACT relevant time series can be retrieved without looking up the Indexing Time Series Data is an interesting problem entire database exhaustively. It is easy to propose a simple that has attracted much interest in the research community technique for evaluating similarity search queries in which for the last decade. Traditional indexing methods organize a time series data of length is mapped into a point in the data space using different metrics. For time series, how- -dimensional space, and a spatial access method such as ever, there are some cases when a metric is not suited for the R-tree [2, 10] is used to index them. However, it is properly assessing the similarity between sequences. For not uncommon that time series data are sampled during a instance, to detect similarities between sequences that are relatively long period of time. Thus, a direct application locally out of phase Dynamic Time Warping (DTW) must of spatial access methods will require mapping time series be used. DTW is not a metric as it does not satisfy the trian- data into a very high dimensional vector space. In con- gular inequality. Therefore, traditional spatial access meth- sequence, this approach would likely suffer performance ods cannot be used without introducing false dismissals. In degradation due to reduced pruning power of an index, a such cases, alternative methods for organizing and search- phenomenon known as the curse of dimensionality. ing time series data must be proposed. In this paper we To address the problem of indexing large multime- propose the use of quantization to generate small and ho- dia objects, the general approach is to apply a transforma- mogeneous representations of time series. We compute tion that produces an indexable representation of the mul- upper- and lower-bounds on the DTW distance to a query timedia object [8]. This representation is usually a vector sequence using this quantized representation to filter-out that conveys relevant information about the main charac- sequences that cannot be a best match for the query. In the teristics of the object and it is usually known as feature proposed approach, efficient search is achieved by organiz- or signature vector. Feature extraction from time series ing the quantized representation of data in a linear array data has been known as dimensionality reduction because that can be efficiently read from disk. The computational it generally represents a time series with a small number cost of processing the query is shadowed by the IO cost re- of values. Many promising techniques have been pro- quired to scan the file containing the linear array and it does posed to generate a small representation of a time series affect the total query cost. data for improving similarity search performance. Among these techniques, we can cite Discrete Fourier Transforma- KEY WORDS tion [1], Discrete Wavelet Transformation [4, 21, 25], Sin- Time Series, Time Warping, Similarity Search. gular Value Decomposition [5], Segmentation [26, 14, 17], and Quantization [24, 23]. Most of these techniques orga- nize and search time series data using the Euclidean dis- 1 Introduction tance as the metric for similarity. There are some cases where Euclidean distance, and A time series data is a potentially long sequence of val- norms in general, may not be entirely adequate for es- ues, each of which represents the measurement of an at- timating similarity. The reason is that norms are sensi- tribute of interest at a point in time. Due to the time- tive to distortions in the time axis. To avoid this problem, varying nature of the universe, there are countless exam- similarity models should allow some elastic shifting of the ples of time series data from diverse sources and applica- time dimension to detect similar shapes that are locally out tions, such as stock prices, currency exchange rates, elec- of phase [6]. This is the case of Dynamic Time Warping trocardiograms, environmental data, and gene expression (DTW), introduced to the context of time series by Berndt measurements. With the growing popularity of time series and Clifford [3]. DTW is used when the size of two time data, there is an increasing demand to support fast retrieval series is different or when it is required to match sequences of time series data based on similarity measurements. Sim- that are out of phase. ilarity search, or query by content, on time series data is In this paper we propose the use of quantization for important to many application domains such as informa- efficient similarity search of time series under time warp- tion retrieval, data mining and clustering. Such queries are ing. We recognize that this problem is heavily bound by used when applications need to compare the temporal evo- IO operations and address this issue in a twofold manner. lution of data to a particular pattern known as the query First, we minimize the time spent accessing the index at pattern. For example, a fund manager of a stock brokerage search time. Second, we reduce the number of data time firm may be interested in finding all stocks whose prices series that must be fetched to guarantee an exact answer moved similarly to that of a particular stock or following a to the query. To address the latter, we propose the use of certain pattern (e.g., head-and-shoulder). quantization. Quantization provides a compact and homo- To efficiently evaluate similarity search queries, it is geneous approximation of time series that is more accurate important to organize time series data in such a way that than other approximations. To minimize the index access 1 time, we organize the quantized representations of the data to answer dynamic time warping queries without introd- time series in a linear index. This allow us to take advan- icing false dismissals. Essentially, a constrained warping tage of fast sequential disk accesses and reduce the time path creates an envelope around the query time series. This spent during index search as a consequence. envelope can be approximated using a modified Piecewise The rest of the paper is organized as follows. In the Aggregate Approximation (PAA), which fully contains ev- next section we provide an overview of the most influential ery point in the envelope (i.e., using the MIN and MAX ag- work for processing similarity search queries under time gregate functions instead of AVG). This approximation to warping. In Section 3 we describe our proposal for effi- the envelope can then be used to define a distance func- ciently processing this class of queries. Section 4 describes tion that lower-bounds the dynamic time warping distance. the experimental evaluation of the proposed technique. Fi- The concept of envelope was later improved by Zhu and nally, in Section 5 we summarize the benefits of our pro- Shasha [28]. They proved that the approximation to the posal. envelope does not need to contain every point in the en- velope but only every approximated point in the envelope. This property is termed container-invariant. A container- 2 Previous Work invariant transformation of an envelope guarantees no false dismissals during dynamic time warping queries [28]. Using Dynamic Time Warping for measuring similarity be- tween sequences is problematic for two reasons. First, series of length and takes DTW has a high computational cost, comparing two time time. Second, 3 A new Technique for Efficient Similarity DTW is not a metric as it does not satisfy the triangular in- Search under Time Warping equality. Therefore, spatial access methods cannot be used without introducing false dismissals [27]. These two is- Our proposal for the efficient processing of similarity sues have been studied and different solutions have been search on time series data under time warping is twofold. proposed by the research community. In this section we provide a summary of the most influential work addressing First, we propose to use quantization to generate a small these problems. and accurate approximation of time series data. Previous research work has shown that quantization provides better To reduce its high computational cost, Yi et al. [27] approximation to time series under a variety of scenarios proposed the use of FastMap [9] in combination with a than other dimensionality reduction techniques [18]. Sec- lower-bounding distance function to DTW based on the ond, indexing large vectors is problematic due to the curse minimum and maximum values of a sequence. FastMap is of dimensionality. In particular, the performance of hierar- used to reduce the dimensionality of the objects being com- chical index degrades as we increase the dimensionality of pared and, in consequence, the computational cost of DTW. the search space. To reduce the time spend accessing the The lower-bounding distance is used to prune away non- index, we favor the use of a linear index over a hierarchical qualifying elements. However, because the Dynamic Time index. Warping distance does not satisfy the triangle inequality,the use of FastMap might result in a number of false dismissals. The problem of the high cost of DTW was also ad- dressed by Keogh et al. [16]. They introduced the Piece- 3.1 Quantizing Time Series wise Dynamic Time Warping (PDTW) distance, which ap- proximates the dynamic time warping distance using only To quantize time series data, we use the Self COntained the Piecewise Aggregate Approximation (PAA) [14] data Bit Encoding (SCoBE) introduced by Lopez and Moon as representation of the time series being compared. Because it produces better approximations than other quantization it only provides an approximation to the actual DTW dis- techniques when applied to time series data [18]. SCoBE tance, this approach introduces false dismissals. In addi- approximates a time series data by first segmenting it into tion, it is sensitive to the arguments provided by the user disjoint subsequences. For each subsequence, the mini- regarding the compression ratio of the data representation. mum and maximum observed values are recorded. Once That is, the number of resulting false dismissals is sensitive the subsequences have been defined1 , we quantize the val- to the number of segments used for the PAA representation ues in each segment with respect to its range. For this, we (PAA is a lossy compression technique, the approximation error increases as we reduce the number of segments). A partition the range into cells, where is a user defined similar approach was presented by Chan et al. [5] using the number of bits, and assign a bit encoding to each of these cells. Each value in the segment is then approximated by Haar Wavelet Transformation as their data transformation. The model they proposed is called Low Resolution Time the bit string (i.e., between 0 and ) representing the Warping and it provides an approximation of the DTW be- cell where the value falls. tween the two sequences being compared. Because it only When a data object is quantized, its values are no approximates DTW, this model cannot guarantee no false longer represented by a point in space but by a bounding dismissals either. region. Using this region, it is easy to define both upper- Park et al. [20] proposed an indexing technique for and lower-bound distances to a query sequence for any similarity search on time series data based on time warp- norm. Given a query time series and a data object , ing distance. Their approach is based on categorizing the both of length , and the bounding region correspond- ! values found in a time series to create a string. The strings ing to the quantized approximation of , , we define so obtained are inserted into a disk-based suffix tree for in- dexing and searching. distance a lower-bound distance " #$ %! as follows. and an upper-bound Keogh [13] observed that the suffix-tree used by Park et al. can be orders of magnitude larger than the orig- inal data and proposed an indexing technique based on a 1 Lopez and Moon proposed several segmentation schemes to reduce restricted time warping distance. He showed that, by limit- the approximation error. We do not discuss such schemes here as it is not ing the warping path, an index-based approach can be used the focus of this paper. Definition 1 (Lower-bound distance) would take -, ./10 time to find the first nearest neigh- 0 bor in a database with objects, organizing a high dimen- sional space severelly affects index performance. In Fact, (1) previous research work indicates that for high dimensional data, searching in a hierarchical index reduces to a linear search [24]. That is a large portion of the index needs to if be accessed during search. Because the index pages are fetched randomly from disk, the query performance suffers if due to the large IO cost. Instead of using a hierarchical in- otherwise dex, we organize the quantized representations of the data time series sequentially in a file. During search, we se- where and represent the top and bottom edges of the quentially scan this file. Reading a file sequentially is more efficient because of pre-fetching. In addition, the compu- bounding region defined by the quantized representation of . tational cost of processing the query can be masked-out by the IO cost thanks to the use of direct memory access. Definition 2 (Upper-bound distance) A two-step algorithm was presented by Weber et al. [24] for 2 -NN search on the VA-File (a linear index similar to ours). We use a modified version of this algo- " # % rithm for exact 2 -NN search under time warping. We use "!$# % (2) Weber’s algorithm to build a list of candidates ranked by their lower-bound distance to the envelope around query object. The data of all objects in the candidate list with &!'# )( lower-bound distance smaller than the actual 2 -NN dis- tance are fetched from disk. These data are used to compute A lower-bound distance guarantees no false dis- the DTW distance from the object to the query. If this dis- missals when evaluating a similarity search query [8]. An tance is smaller than the 2 -NN distance, the corresponding upper-bound distance, on the other hand, is not necessary object is included in the answer to the query. If by inserting for finding correct search results. However, it can be used a new element into the answer set, its size grows larger than to further reduce the search space. 2 , one of them is removed from the set. Only the 2 closest objects are kept in the answer set. 3.2 * -NN Search under Time Warping 4 Performance Evaluation Keogh [13] proved that the dynamic time warping distance between two time series can be exactly indexed (i.e., with- In this section, we empirically demonstrate the perfor- out introducing false dismissals) by constraining its warp- mance benefits of using quantization organized in a lin- ing path. That is, when comparing two time series, the ear index for similarity search under time warping. In our displacement that a particular value in the series can have experiments, we used 5 data sets from different sources. along the time dimension is restricted. For a query time Three different performance metrics were used to evalu- series , this constrained path creates an envelope around ate the performance of the different techniques presented it. Let us denote the envelope around by + . Keogh in this experimental study. showed that the Euclidean distance between a time series to + is a lower-bound on the constrained DTW dis- tance between and . Because this lower-bound distance 4.1 Data Sets guarantees no false dismissals, the envelope created by the constrained warping path on a query sequence can be used Our test data sets were obtained from sources of different fields such as medicine, finance, and astrophysics. We did to answer exact similarity queries under time warping. In this work, we extended Keogh’s envelope idea for this in an effort to minimize data bias2 in our experiments. exact similarity search under time warping. Keogh orga- Four data sets were from real world data sources and one was synthetically generated. From each data source we nizes the data time series by inserting their PAA approx- generated data sets each of which contained time series ob- imation into a hierarchical index. To compute a lower- jects of length 256. Each data set contains 100,000 time bound on the DTW distance, he must apply the same ap- series objects of the same length. We randomly extracted proximation to the query envelope. As a consequence 1,000 entries from each data set. This subset of 1,000 time of approximating the envelope, the quality of the lower- series objects became our query set. We did this to avoid bonding distance deteriorate and the number of false alarms exact matches with queries in our experiments. In addition, increases. To keep the number of false alarms down to a this practice allowed us to use a query object that is in the minimum, we propose to use the query envelope without same domain as the data set. Details of our testing data are approximating it. The raw envelope is compatible with given next. the quantized approximation of the data time series and a lower-bounding distance can be directly computed. To (1) Plasma: We obtained this data from the Coordi- achieve this, We use the Euclidean distance between the nated Heliospheric Observations Web server (CO- raw envelope and the quantized representation of the data HOWeb) [7]. COHOWeb provides access to hourly time series. Equations 1 and 2 can be easily modified to resolution magnetic field and plasma data from 12 dif- compute a lower bound on the Euclidean distance when the ferent heliospheric spacecraft. In our experiments, we query object is a region (i.e., the query envelope) instead of used the plasma temperatures reported by these space- a point. craft since 1963. To efficiently process the search, we avoid the use of hierarchical indexes as they are prone to suffer from the 2 Data bias is the conscious or unconscious use of a particular set of curse of dimensionality. While an ideal hierarchical index testing data to confirm a desired finding [15]. (2) EEG: This data arose from a large study to exam- time dimension. Given two time series and , this band ine electroencephalogram correlations of genetic pre- indicates that value in can only be compared to values disposition to alcoholism. It contains measurements to in , where is the value of time series from 64 electrodes placed on subject’s scalps which were sampled at 256 Hz for 1 second [11]. at time , and is time displacement allowed by the band. In our experiments, we set the Sakoe-Chiba band to allow (3) ECG: This data set was generated from electrocardio- the time series to stretch about on the time line (i.e., ). gram data obtained from the MIT-BIH database distri- bution [19]. (1) Keogh’s Envelope: We implemented Keogh’s enve- (4) Mixed Bag: This data set was generated from seven lope as indicated in [13]. We applied the Piecewise different data sets. Since the source data sets have dif- Aggregate Approximation (PAA) to every entry in the ferent properties [17], a preprocessing step was per- data set and inserted the corresponding data approxi- formed to normalize the data. The resulting data set mation into a disk-resident R-tree. has a mean of zero and standard deviation of one. (5) FinTime: Using the financial time series bench- (2) Zhu’s Improved Envelope: We implemented the con- mark [12], we synthetically generated stock values for tainer invariant envelope as described by Zhu ans 100,000 companies. While the synthetic data genera- Shasha in [28]. We applied PAA to every envelope and tor provides stock values at opening and closing times, inserted the corresponding entry into a disk-resident we only used the closing time values in our experi- R-tree. ments. (3) Quantization: We generated an envelope for the query sequence and used the SCoBE approximation 4.2 Performance Metrics on all the data time series to obtain a compact and accurate representation. While any quantization such In our experiments, we evaluated the efficiency of differ- as VA can be used, previous research work proved this ent techniques using three metrics. We measured the index quantization works better for similarity search on time search overhead and the number of data objects fetched as series [18]. The data approximations are laid out in a the main factors affecting overall performance of similarity linear index and processed sequentially during search. search. We also measured elapsed time as the performance metric directly perceived by the user. All indexes were built using pages of 8 KBytes. We (1) Index search overhead: is the number of disk pages used 32-bit words for storing the values of time series data. that must be read from the index when processing a The experiments were performed on Intel Pentium com- similarity search query. puters with 600 MHz processors and Linux operating sys- tem. Each computer has 128 MBytes of main memory and (2) Data objects fetched: It has been suggested that we 9 GBytes of disk storage connected with a SCSI interface. should prevent performance bias due to disparities in For Keogh’s and Zhu’s envelopes, we built indexes the quality of implementation of the methods being using feature vectors of 8, 16, and 32 dimensions (i.e., the compared [15]. Following this recommendation, we values of segments in the PAA transformation of the en- evaluate the effectiveness of different 2 -NN search velope). There is a performance trade-off depending on methods by the number of data objects fetched from the size of the feature vector. On one hand, a higher- the database. The number of data objects fetched dur- dimensional vector provides a more accurate representation ing a 2 -NN query represents a performance metric that reducing the data access overhead. At the same time, it is independent of the quality of implementation. degrades the performance of the hierarchical index by in- (3) Elapsed time: We used wall-clock time to measure the creasing the number of index pages that must be accessed elapsed time during the evaluation of 2 -NN queries. (i.e., the curse of dimensionality). As we know, the overall This time includes both CPU and IO time. In addition, search performance depends largely on these two factors. for each query we recorded the time spent on CPU op- By varying the size of the feature vector, we were able to erations only. This allows us to estimate the time spent identify the best case for each data set. The results shown in IO operations. Finally, to avoid any caching effects in this section represent the best observed performance for on the index and data files, we flushed the main mem- Keogh’s envelope and Zhu’s Improved envelope. ory between consecutive queries by loading irrelevant The quantizing approximations do not reduce dimen- data into the system. sionality explicitly. Instead, the size of the data represen- tation is reduced by reducing the resolution of each value For each metric used in our experiments, we present (i.e., quantizing it). In our experiments, we used approx- the average value after executing 100 2 -NN queries. imations of the size of the data objects. Note that this achieves the same compression ratio as using 32 segments on the PAA representation. 4.3 Evaluated Techniques We have evaluated DTW queries using quantization and compared our results to those obtained using Keogh’s enve- 4.4 Experimental Results lope. We also include Zhu and Shasha’s improved envelope in our empirical evaluation. In our experiments, we cre- In our experiments, we extensively evaluated the perfor- ated an envelope around the query time series by limiting mance of the techniques described in Section 4.3. We mea- the warping path using the Sakoe-Chiba band [22]. When sured the performance of these techniques on different data comparing two time series, this band restricts the displace- sets. The number of objects in each data set remained fixed ment that a particular value in the series can have along the at 99,000 entries. Data Index Search Overhead (pages) Data Elapsed Time (seconds) set Envelope Improved E. Quantized set Envelope Improved E. Quantized Plasma 4,185 4,073 3,094 Plasma 124.323 107.423 56.492 EEG 4,637 4,524 3,094 EEG 152.377 123.649 66.638 ECG 414 280 3,094 ECG 6.404 3.617 2.422 Mixed Bag 1,329 648 3,094 Mixed Bag 21.919 17.714 11.036 FinTime 639 244 3,094 FinTime 6.126 3.341 2.147 (a) Number of Disk Pages Fetched from the Index Table 2. Elapsed time for 10-NN Dynamic Time Warping Queries. Results are Average from 100 Queries. Data Objects Fetched set Envelope Improved E. Quantized Plasma 29,928 25,338 22,119 approach, consistently fetched the smallest number of time EEG 23,965 19,000 16,369 series objects on all data sets. ECG 1,063 468 188 Mixed Bag 4,358 4,080 3,580 FinTime 275 209 119 4.4.3 Elapsed Time (b) Number of Data Objects Fetched To this point, we have presented performance metrics aimed to understand the behavior of the different similarity Table 1. IO Performance for 10-NN Queries under Time search techniques used in this experimental study. In this Warping. Results are Average from 100 Queries. Index section, we conclude our experimental evaluation by pre- Pages from the Quantized Approach are Read Sequentially. senting the performance as perceived by the user, and show the elapsed time measured by wall-clock. We loaded irrel- evant data into the system to flush the entire memory be- 4.4.1 Index Search Overhead tween consecutive queries and avoid caching effects. Our results are summarized in Table 2. In Table 1(a), we present our observations on the Index Table 2 shows that quantization consistently outper- Search Overhead for all the techniques used in this study. formed the state of the art techniques for DTW queries. This table shows the number of IO operations on the in- This was the result of combining efficient accesses to the dex while processing the query. We observe that using a index (being sequentially read) with an approximation that hierarchical index can have a negative effect on the search required fetching a smaller number of data objects than performance. In particular, a large number of index pages other methods. are read by the envelope and improved envelope approach for the Plasma and EEG datasets. In addition, we should re- member that a hierarchical index is accessed using random 5 Conclusions IO. Therefore, if a large portion of the hierarchical index is accessed during similarity search, the performance benefits In this paper, we have presented and approach for effi- of having an index structure are nullified due to the high ciently processing similarity search on time series data un- cost of random IO operations. On the other hand, a lin- der time warping. Our approach is based on quantization ear index (e.g., like the index used with quantization in our to generate a small and accurate approximation of time se- approach) is accessed sequentially and this access pattern ries data. We have shown that these approximations, orga- takes advantage of fast sequential IO operations. There- nized in a linear index, outperforms sophisticated transfor- fore, despite the fact that more index pages are read by our mations and hierarchical indexes during exact 2 -NN DTW approach for the ECG, Mixed Bag, and FinTime datasets search. We have provided experimental evidence showing the overall query performance is not significantly affected that, our proposed technique provided a consistently good and our proposal is still faster than the other methods. performance under a variety of settings. The effectiveness of our approach for similarity search search under time warping comes from minimiz- 4.4.2 Data Objects Fetched ing the index search overhead as well as from reducing the number of data objects fetched. By organizing the quan- The quality of an approximation can be measured by the tized approximations in a linear index, we can take advan- number of data objects fetched during search. Tight bound- tage of fast sequential disk accesses. In consequence, we ing distances provided by the approximation will have a are able to reduce the time spent during index search. In ad- positive effect on reducing the data search space. In Ta- dition, quantization provides tight upper- and lower-bound ble 1(b), we present the number of objects that each tech- distances to the data objects. This allows us to filter out a nique fetched during a -NN query. This table shows the large number of irrelevant entries and to drastically reduce the data search space. results on all data sets. Note that the number of data pages accessed during search can be estimated from the number of objects fetched. One 8-KByte page stores four data time series of length 256. Therefore, the number of data pages References read is given by the following expression. [1] Rakesh Agrawal, Christos Faloutsos, and Arun N. Swami. ( ( ( ( Efficient Similarity Search in Sequence Databases. In Pro- ceedings of the FODO Conference, pages 69–84, Evanston, From Table 1(b), we can observe that our proposed IL, October 1993. [2] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and [17] Eamonn Keogh, Kaushik Chakrabarti, Sharad Mehrotra, Bernhard Seeger. The -tree: An Efficient and Robust Ac- and Michael Pazzani. Locally Adaptive Dimensionality Re- cess Method for Points and Rectangles. In Proceedings of the duction for Indexing Large Time Series Databases. In Pro- ACM-SIGMOD Conference, pages 322–331, Atlantic City, ceedings of the ACM-SIGMOD Conference, pages 151–162, NJ, May 1990. Santa Barbara, CA, May 2001. [3] Donald J. Berndt and James Clifford. Using Dynamic Time [18] Ines F. Vega Lopez and Bongki Moon. A Quantization Ap- Warping to Find Patterns in Time Series. In Proceedings of proach for Efficient Similarity Search on Time Series Data. the AAAI Workshop on Knowledge Discovery in Databases, In Proceedings of the International Conference on Inter- pages 359–370, Seattle, WA, July 1994. net Information Retrieval, pages 182–189, Goyang, Korea, November 2004. [4] Kin-Pong Chan and Ada Wai-Chee Fu. Efficient Time Series Matching by Wavelets. In Proceedings of the International [19] George B. Moody. MIT-BIH Database Distribution. Conference on Data Engineering, pages 126–133, Sydney, http://ecg.mit.edu/index.html, 1999. Australia, March 1999. IEEE Computer Society. [20] Sanghyun Park, Wesley W. Chu, Jeehee Yoon, and Chi- hcheng Hsu. Efficient Searches for Similar Subsequences [5] King-Pong Chan, Ada Wai-Chee Foo, and Clement Yu. of Different Lengths in Sequence Databases. In Proceedings Haar Wavelets for Efficient Similarity Search of Time-Series: of the International Conference on Data Engineering, pages With and Without Time Warping. IEEE Transactions on 23–32, San Diego, California, 28 February - 3 March 2000. Knowledge and Data Engineering, 15(3):686–705, May– IEEE Computer Society. June 2003. [21] Ivan Popivanov. Efficient Similarity Queries over Time [6] Selina Chu, Eamonn Keogh, David Hart, and Michael Paz- Series Data using Wavelets. Master’s thesis, University of zani. Iterative Deepening Dynamic Time Warping for Time Toronto, Toronto, Canada, 2001. Series. In Proceedings of the 2nd SIAM International Con- [22] Hiroaki Sakoe and Seibi Chiba. Dynamic Programming Al- ference on Data Mining, Arlington, VA, April 2002. gorithm Optimization for Spoken Word Recognition. IEEE [7] COHOWeb. Deep space hourly merged mag- Transactions on Acoustics, Speech, and Signal Processing, netic field, plasma, and ephemerides data. 26(1):43–49, February 1978. http://nssdc.gsfc.nasa.gov/cohoweb/cw.html, [23] Yasushi Sakurai, Masatoshi Yoshikawa, Shunsuke Uemura, October 2002. and Haruhiko Kojima. The A-Tree: An Index Structure for [8] Christos Faloutsos. Searching Multimedia Databases By High-dimensional Spaces Using Relative Approximation. In Content. Kluwer Academic Publishers, Boston, MA, 1996. Proceedings of the VLDB Conference, pages 516–526, Cairo, Egypt, September 2000. [9] Christos Faloutsos and King-Ip Lin. FastMap: A Fast Algo- [24] Roger Weber, Hans-Jorg Schek, and Stephen Blott. A rithm for Indexing Data-Mining and Visualization of Tradi- Quantitative Analysis and Performance Study for Similarity- tional and Multimedia Datasets. In Proceedings of the ACM- Search Methods in High-Dimensional Spaces. In Proceed- SIGMOD Conference, pages 163–174, San Jose, CA, May ings of the VLDB Conference, pages 194–205, New York, 1995. USA, August 1998. [10] Antonin Guttman. R-Trees: a Dynamic Index Structure [25] Yi-Leh Wu, Divyakant Agrawal, and Amr El Abbadi. A for Spatial Searching. In Proceedings of the ACM-SIGMOD Comparison of DFT and DWT Based Similarity Search in Conference, pages 47–57, Boston, MA, June 1984. Time-Series Databases. In Proceedings of the ACM-CIKM [11] S. Hettich and S. D. Bay. The UCI KDD Archive. Conference, pages 488–495, McLean, VA, November 2000. http://kdd.ics.uci.edu, 2002. [26] Byoung-Kee Yi and Christos Faloutsos. Fast Time Sequence [12] Kaippallimalil J. Jacob and Dennis Shasha. Fin- Indexing for Arbitrary Norms. In Proceedings of the Time – a Financial Time Series Benchmark. VLDB Conference, pages 385–394, Cairo, Egypt, September http://cs.nyu.edu/cs/faculty/shasha/fintime.html, 2000. March 2000. [27] Byoung-Kee Yi, H. V. Jagadish, and Christos Faloutsos. [13] Eamon Keogh. Exact Indexing of Dynamic Time Warp- Efficient Retrieval of Similar Time Sequences Under Time ing. In Proceedings of the VLDB Conference, pages 406– Warping. In Proceedings of the International Conference on 417, Hong Kong, China, August 2002. Data Engineering, pages 201–208, Orlando, Florida, Febru- ary 1998. IEEE Computer Society. [14] Eamon Keogh, Kaushik Chakrabarti, Michael Pazzani, and [28] Yunyue Zhu and Dennis Shasha. Warping Indexes With Sharad Mehrotra. Dimensionality Reduction for Fast Sim- Envelope Transforms for Query by Humming. In Proceed- ilarity Search in Large Time Series Databases. Knowledge ings of the ACM-SIGMOD Conference, pages 181–192, San and Information Systems, 3(3):263–286, 2000. Diego, CA, June 2003. [15] Eamon Keogh and Shruti Kasetty. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In Proceedings of the ACM-SIGKDD Con- ference, pages 102–111, Edmonton, Alberta, Canada, July 2002. [16] Eamon Keogh and Michael Pazzani. Scaling up Dynamic Time Warping for Datamining Applications. In Proceedings of the ACM-SIGKDD Conference, pages 285–289, Boston, MA, August 2000.

References (28)

Rakesh Agrawal, Christos Faloutsos, and Arun N. Swami. Efficient Similarity Search in Sequence Databases. In Pro- ceedings of the FODO Conference, pages 69-84, Evanston, IL, October 1993.
Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The ¢¡ -tree: An Efficient and Robust Ac- cess Method for Points and Rectangles. In Proceedings of the ACM-SIGMOD Conference, pages 322-331, Atlantic City, NJ, May 1990.
Donald J. Berndt and James Clifford. Using Dynamic Time Warping to Find Patterns in Time Series. In Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, pages 359-370, Seattle, WA, July 1994.
Kin-Pong Chan and Ada Wai-Chee Fu. Efficient Time Series Matching by Wavelets. In Proceedings of the International Conference on Data Engineering, pages 126-133, Sydney, Australia, March 1999. IEEE Computer Society.
King-Pong Chan, Ada Wai-Chee Foo, and Clement Yu. Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping. IEEE Transactions on Knowledge and Data Engineering, 15(3):686-705, May- June 2003.
Selina Chu, Eamonn Keogh, David Hart, and Michael Paz- zani. Iterative Deepening Dynamic Time Warping for Time Series. In Proceedings of the 2nd SIAM International Con- ference on Data Mining, Arlington, VA, April 2002.
COHOWeb. Deep space hourly merged mag- netic field, plasma, and ephemerides data. http://nssdc.gsfc.nasa.gov/cohoweb/cw.html, October 2002.
Christos Faloutsos. Searching Multimedia Databases By Content. Kluwer Academic Publishers, Boston, MA, 1996.
Christos Faloutsos and King-Ip Lin. FastMap: A Fast Algo- rithm for Indexing Data-Mining and Visualization of Tradi- tional and Multimedia Datasets. In Proceedings of the ACM- SIGMOD Conference, pages 163-174, San Jose, CA, May 1995.
Antonin Guttman. R-Trees: a Dynamic Index Structure for Spatial Searching. In Proceedings of the ACM-SIGMOD Conference, pages 47-57, Boston, MA, June 1984.
S. Hettich and S. D. Bay. The UCI KDD Archive. http://kdd.ics.uci.edu, 2002.
Kaippallimalil J. Jacob and Dennis Shasha. Fin- Time -a Financial Time Series Benchmark. http://cs.nyu.edu/cs/faculty/shasha/fintime.html, March 2000.
Eamon Keogh. Exact Indexing of Dynamic Time Warp- ing. In Proceedings of the VLDB Conference, pages 406- 417, Hong Kong, China, August 2002.
Eamon Keogh, Kaushik Chakrabarti, Michael Pazzani, and Sharad Mehrotra. Dimensionality Reduction for Fast Sim- ilarity Search in Large Time Series Databases. Knowledge and Information Systems, 3(3):263-286, 2000.
Eamon Keogh and Shruti Kasetty. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In Proceedings of the ACM-SIGKDD Con- ference, pages 102-111, Edmonton, Alberta, Canada, July 2002.
Eamon Keogh and Michael Pazzani. Scaling up Dynamic Time Warping for Datamining Applications. In Proceedings of the ACM-SIGKDD Conference, pages 285-289, Boston, MA, August 2000.
Eamonn Keogh, Kaushik Chakrabarti, Sharad Mehrotra, and Michael Pazzani. Locally Adaptive Dimensionality Re- duction for Indexing Large Time Series Databases. In Pro- ceedings of the ACM-SIGMOD Conference, pages 151-162, Santa Barbara, CA, May 2001.
Ines F. Vega Lopez and Bongki Moon. A Quantization Ap- proach for Efficient Similarity Search on Time Series Data. In Proceedings of the International Conference on Inter- net Information Retrieval, pages 182-189, Goyang, Korea, November 2004.
George B. Moody. MIT-BIH Database Distribution. http://ecg.mit.edu/index.html, 1999.
Sanghyun Park, Wesley W. Chu, Jeehee Yoon, and Chi- hcheng Hsu. Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases. In Proceedings of the International Conference on Data Engineering, pages 23-32, San Diego, California, 28 February -3 March 2000. IEEE Computer Society.
Ivan Popivanov. Efficient Similarity Queries over Time Series Data using Wavelets. Master's thesis, University of Toronto, Toronto, Canada, 2001.
Hiroaki Sakoe and Seibi Chiba. Dynamic Programming Al- gorithm Optimization for Spoken Word Recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43-49, February 1978.
Yasushi Sakurai, Masatoshi Yoshikawa, Shunsuke Uemura, and Haruhiko Kojima. The A-Tree: An Index Structure for High-dimensional Spaces Using Relative Approximation. In Proceedings of the VLDB Conference, pages 516-526, Cairo, Egypt, September 2000.
Roger Weber, Hans-Jorg Schek, and Stephen Blott. A Quantitative Analysis and Performance Study for Similarity- Search Methods in High-Dimensional Spaces. In Proceed- ings of the VLDB Conference, pages 194-205, New York, USA, August 1998.
Yi-Leh Wu, Divyakant Agrawal, and Amr El Abbadi. A Comparison of DFT and DWT Based Similarity Search in Time-Series Databases. In Proceedings of the ACM-CIKM Conference, pages 488-495, McLean, VA, November 2000.
Byoung-Kee Yi and Christos Faloutsos. Fast Time Sequence Indexing for Arbitrary £ ¥¤ Norms. In Proceedings of the VLDB Conference, pages 385-394, Cairo, Egypt, September 2000.
Byoung-Kee Yi, H. V. Jagadish, and Christos Faloutsos. Efficient Retrieval of Similar Time Sequences Under Time Warping. In Proceedings of the International Conference on Data Engineering, pages 201-208, Orlando, Florida, Febru- ary 1998. IEEE Computer Society.
Yunyue Zhu and Dennis Shasha. Warping Indexes With Envelope Transforms for Query by Humming. In Proceed- ings of the ACM-SIGMOD Conference, pages 181-192, San Diego, CA, June 2003.

Quantizing time series for efficient similarity search under timewarping

Sign up for access to the world's latest research

Abstract

Related papers

References (28)

Related papers

Related topics