Shape-based retrieval in time-series databases

Sang-Wook Kim; Jeehee Yoon; Sanghyun Park; Jung-Im Won

doi:10.1016/J.JSS.2005.05.004

Outline

Shape-based retrieval in time-series databases

Sang-wook Kim

2006, Journal of Systems and Software

https://doi.org/10.1016/J.JSS.2005.05.004

Abstract

The shape-based retrieval is defined as the operation that searches for the (sub)sequences whose shapes are similar to that of a query sequence regardless of their actual element values. In this paper, we propose a similarity model suitable for shape-based retrieval and present an indexing method for supporting the similarity model. The proposed similarity model enables to retrieve similar shapes accurately by providing the combination of multiple shape-preserving transformations such as normalization, moving average, and time warping. Our indexing method stores every distinct subsequence concisely into the disk-based suffix tree for efficient and adaptive query processing. We allow the user to dynamically choose a similarity model suitable for a given application. More specifically, we allow the user to determine the parameter p of the distance function L p when submitting a query. The result of extensive experiments revealed that our approach not only successfully finds the subsequences whose shapes are similar to a query shape but also significantly outperforms the sequential scan method.

The Journal of Systems and Software 79 (2006) 191–203 www.elsevier.com/locate/jss Shape-based retrieval in time-series databases a,* Sang-Wook Kim , Jeehee Yoon b, Sanghyun Park c, Jung-Im Won c a College of Information and Communications, Hanyang University, 17 Haengdang, Seongdong, Seoul 133-791, Republic of Korea b Division of Information Engineering and Telecommunications, Hallym University, 39 Hallymdaehak-gil, Chuncheon, Kangwon 200-702, Republic of Korea c Department of Computer Science, Yonsei University, 134 Sinchon, Seodaemoon Gu, Seoul 120-749, Republic of Korea Received 24 May 2004; received in revised form 5 May 2005; accepted 7 May 2005 Available online 22 June 2005 Abstract The shape-based retrieval is deﬁned as the operation that searches for the (sub)sequences whose shapes are similar to that of a query sequence regardless of their actual element values. In this paper, we propose a similarity model suitable for shape-based retrie- val and present an indexing method for supporting the similarity model. The proposed similarity model enables to retrieve similar shapes accurately by providing the combination of multiple shape-preserving transformations such as normalization, moving aver- age, and time warping. Our indexing method stores every distinct subsequence concisely into the disk-based suﬃx tree for eﬃcient and adaptive query processing. We allow the user to dynamically choose a similarity model suitable for a given application. More speciﬁcally, we allow the user to determine the parameter p of the distance function Lp when submitting a query. The result of exten- sive experiments revealed that our approach not only successfully ﬁnds the subsequences whose shapes are similar to a query shape but also signiﬁcantly outperforms the sequential scan method. 2005 Elsevier Inc. All rights reserved. Keywords: Similarity search; Shape-based retrieval; Time-series databases 1. Introduction 1993, 1995a; Faloutsos et al., 1994). Similarity search is of growing importance in many new applications such The time-series database is a set of data sequences as data mining and data warehousing (Chen et al., 1996; (hereafter, we simply call them sequences), each of Raﬁei and Mendelzon, 1997). which is an ordered list of elements (Agrawal et al., In order to measure the similarity of any two 1993). Sequences of stock prices, money exchange rates, sequences of length n, most approaches (Agrawal temperature data, product sales data, and company et al., 1993; Chu and Wong, 1999; Faloutsos et al., growth rates are the typical examples of time-series dat- 1994; Goldin and Kanellakis, 1995; Raﬁei and Mendel- abases (Agrawal et al., 1995a; Faloutsos et al., 1994). zon, 1997; Raﬁei, 1999) map the sequences into points Similarity search is an operation that ﬁnds sequences in n-dimensional space and compute the Euclidean dis- or subsequences whose changing patterns are similar tance between those points as a similarity measure. to that of a given query sequence (Agrawal et al., However, they often miss the data sequences that are actually similar to a query sequence in usersÕ perspective. * Therefore, recent work on similarity search tends to sup- Corresponding author. Tel.: +82 2 2220 1736; fax: +82 2 2220 port various types of transformations such as scaling 1886. E-mail addresses: [email protected] (S.-W. Kim), jhyoon@ (Agrawal et al., 1995a; Chu and Wong, 1999), shifting hallym.ac.kr (J. Yoon), [email protected] (S. Park), jiwon@ (Agrawal et al., 1995a; Chu and Wong, 1999), normali- cs.yonsei.ac.kr (J.-I. Won). zation (Agrawal et al., 1995a; Chu and Wong, 1999; Das 0164-1212/$ - see front matter 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2005.05.004 192 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 et al., 1997; Goldin and Kanellakis, 1995; Loh et al., tion 3 deﬁnes the notation and terminology used in this 2001), moving average (Loh et al., 2000; Raﬁei and paper and introduces our similarity model. Section 4 Mendelzon, 1997; Raﬁei, 1999), and time warping presents the indexing method for supporting the pro- (Berndt and Cliﬀord, 1996; Kim et al., 2001; Park posed similarity model, and Section 5 describes our et al., 2000, 2001; Yi et al., 1998). query processing method. Section 6 presents the experi- This paper addresses the problem of shape-based re- mental results to show the superiority of our method, trieval that ﬁnds the sequences whose shapes are similar and ﬁnally, Section 7 summarizes and concludes the to that of a given query sequence regardless of their actual paper. element values. To provide a ﬂexible solution to this prob- lem, this paper introduces a new similarity model that em- ploys combinations of multiple transformations such as 2. Related work shifting, scaling, moving average, and time warping. In particular, our similarity model supports multiple In this section, we brieﬂy survey previous research re- Lp distance functions in order to measure the similarity sults associated with similarity search in time-series between the ﬁnally transformed two sequences; If a user databases. chooses one among the Manhattan distance L1, the Agrawal et al. (1993) proposed a method for whole Euclidean distance L2, and the maximum distance L1, matching in time-series databases. First, each data the proposed method performs the shape-based retrieval sequence of length n is transformed into a point in f by using the chosen distance function. The ﬂexibility in (n) dimensional space by using the discrete Fourier choosing multiple distance functions is fairly useful since transform (DFT). For indexing a large number of such users could have diﬀerent opinions about similarity points, an R*-tree (Beckmann et al., 1990) is used. For depending on applications. In addition, an important whole matching, a query sequence of length l is also feature of the proposed method is to perform the transformed into a point over f-dimensional space in shape-based retrieval that supports these three distance the same way, and the R*-tree is traversed to perform functions by using only one index built in advance. the range query by using the transformed query point. Similarity search is classiﬁed into whole matching and As a result, candidate sequences, which are highly likely subsequence matching (Agrawal et al., 1993). to be the ﬁnal answers, are found. Then, every candidate sequence is accessed from disk, and its actual distance to • Whole matching: Given N data sequences S1, . . ., SN, the query sequence is computed. If the distance is smal- a query sequence Q, and a tolerance e, we ﬁnd such ler than e, the candidate sequence is returned as a ﬁnal data sequences Si that are similar to Q. Here, we note answer. that the data and query sequences should be of the Faloutsos et al. (1994) and Moon et al. (2001) pro- same length. posed methods for processing of subsequence matching. • Subsequence matching: Given N data sequences S1, . . ., They use the concept of a window, which is a ﬁxed-sized SN of varying lengths, a query sequence Q, and the tol- subsequence inside the query and data sequences. Each erance e, we ﬁnd all the sequences Si, one or more sub- window of length w is transformed into a window point sequences of which are similar to Q, and the oﬀsets in in f (w) dimensional space by using the DFT or the Si of those subsequences. Here, the data and query wavelet transform. Such window points are indexed by sequences are allowed to be of arbitrary lengths. an R*-tree (Beckmann et al., 1990). For subsequence matching, windows are extracted from a query sequence Since subsequence matching is a generalization of and are transformed into window points in f-dimen- whole matching, it is applicable to practical applications sional space. For each window point, a range query is more than whole matching. performed on the R*-tree to obtain candidate subse- In this paper, we propose a novel method for process- quences, each of which has a high possibility to be ing of shape-based subsequence retrieval. We ﬁrst deﬁne included in the ﬁnal result. Finally, each candidate sub- an eﬀective similarity model for shape-based subsequence sequence is accessed from disk, and its actual Euclidean retrieval and present the indexing and query processing distance to the query sequence is examined. methods for performing the shape-based retrieval that In some applications, similarity search based only on supports this model. To verify the superiority of the ap- Euclidean distance often fails to search for the data se- proach, we perform extensive experiments by using a quences that are actually similar to a query sequence variety of data sets. The results reveal that our approach in usersÕ perspective. Therefore, to give ﬂexibility on successfully ﬁnds all the subsequences that have the deﬁning of the similarity, recent work tends to support shapes similar to that of the query sequence, and also various types of transformations. achieves high search performance. Agrawal et al. (1995a), Chu and Wong (1999), Das This paper is organized as follows. Section 2 brieﬂy et al. (1997), Goldin and Kanellakis (1995), Loh et al. reviews previous work related to similarity search. Sec- (2000) and Raﬁei (1999) proposed the methods for sim- S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 193 ilarity search that supports normalization. Normaliza- Table 1 tion enables ﬁnding sequences that have a ﬂuctuation Notation pattern similar to that of a query sequence even through Notation Description they are not close to each other before the normaliza- S = (s[i]) Data sequence (0 6 i < Len(S)) tion. Das et al. (1997) suggested a whole matching Len(S) Number of elements in S method that ﬁnds similar sequences by accessing an X = (x[i]) Arbitrary subsequence of S (0 6 i < Len(X) 6 Len(S)) Q = (q[i]) Query sequence (0 6 i < Len(Q)) entire database. Also, Agrawal et al. (1995a) and Goldin e Distance tolerance and Kanellakis (1995) proposed methods that exploit First(S) The ﬁrst element of S, s[0] R*-trees as indexes to improve the search performance Rest(S) Sequence composed of all the elements of S except for signiﬁcantly. While Agrawal et al. (1995a), Das et al. the ﬁrst element, (s[1], s[2], . . ., s[Len(S) 1]) (1997) and Goldin and Kanellakis (1995) dealt with only Max(S) The largest element of S Min(S) The smallest element of S whole matching, Chu and Wong (1999) and Loh et al. () Empty sequence (2000) extended their idea to process subsequence Max(a, b) The largest value between a and b matching eﬀectively by using indexes. Min(a, b, c) The smallest value among a, b, and c Loh et al. (2001), Raﬁei and Mendelzon (1997) and Raﬁei (1999) dealt with the methods for similarity Agrawal et al. (1995b) allowed a user to specify a re- search that supports moving average transformation. quired query pattern using their shape deﬁnition lan- The moving average transformation converts a given guage (SDL). Given a query pattern deﬁned by SDL, data sequence into a new sequence consisting of the they utilized the hierarchical index structure to retrieve averages of k consecutive values in the data sequence, the subsequences satisfying the query pattern. This where k is called the moving average coeﬃcient. The method converts numeric elements into their corre- moving average transformation is very useful for ﬁnding sponding symbols and compares the symbol sequences the trend of the time-series data by reducing the eﬀect of without considering their original element values. Two noise inside. Raﬁei and Mendelzon (1997) proposed a numeric elements whose values are very close to each whole matching method that employs the convolution other could be converted into diﬀerent symbols in this deﬁnition (Preparata and Shamos, 1985) to support method. Therefore, this method may judge quite similar moving average transformation of arbitrary coeﬃcient sequences to be dissimilar.1 using only one index. Loh et al. (2000) pointed out the Perng et al. (2000) proposed the landmark model for problems in applying the previous methods to subse- shape-based pattern matching. The landmark model ex- quence matching, and suggested a method that performs tracts the landmarks from the sequences, and compares subsequence matching eﬀectively by using the concept of any two sequences using their landmarks without exam- index interpolation. ining their original element values. The landmark model Berndt and Cliﬀord (1996), Kim et al. (2001), Park enables to ﬁnd the sequences of similar shapes intui- et al. (2000, 2001), Raﬁei and Mendelzon (1997), and tively. However, the landmarks are highly dependent Yi et al. (1998) addressed the issue of supporting time on target applications, thus making it complicated to warping within the similarity model. Time warping is a ﬁnd the landmarks suitable for a speciﬁc application. transformation that allows any sequence element to rep- licate itself as many times as needed, and is useful in sit- 3. Problem deﬁnition uations where sequences are of varying lengths and so their similarity cannot be directly computed by applying This section formally deﬁnes the problem we are the Lp distance function. Berndt and Cliﬀord (1996) and going to solve. Section 3.1 deﬁnes the notation and Yi et al. (1998) proposed whole matching methods, terminology used in this paper. Section 3.2 describes which access and examine every sequence in databases. our similarity model. Kim et al. (2001), Park et al. (2000), and Yi et al. (1998) discussed approaches that exploit indexes to per- 3.1. Notation and terminology form whole matching eﬃciently. Also, Park et al. (2001) extended the basic idea of Kim et al. (2001) to process Table 1 deﬁnes the notation frequently used in this subsequence matching eﬀectively by using the concept paper. A data sequence is a sequence stored in a database of preﬁx-querying. As stated earlier, most previous approaches support only one transformation in their similarity model. In 1 By the categorization stated in Section 4, our method would also time-series database applications, however, it is fre- convert similar values into diﬀerent symbols. Unlike the method in quently required to retrieve data (sub)sequences whose Agrawal et al. (1995b) that only compares two symbolized sequences without considering their values, our method makes value-based shapes are similar to that of a query sequence regardless comparisons in index traversal by using a lower-bound distance of their actual element values. In this paper, we deﬁne function Dtw-lb. Thus, our method does not make such a wrong this kind of a problem as the shape-based retrieval. judgment due to the symbolization. 194 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 and a query sequence is a sequence submitted for S1 6 S2 querying. Norm(S1) Norm(S2) 4 Deﬁnition 1. Given a sequence S = (s[i]) (0 6 i < Len(S)), its normalized sequence Norm(S) = (s 0 [i]) Element value 2 (0 6 i < Len(S)) is deﬁned as follows (Agrawal et al., 1995a)2: 0 s½i MaxðSÞþMinðSÞ s0 ½i ¼ 2 MaxðSÞMinðSÞ . -2 2 -4 Normalization, a combination of shifting and scaling, 1 1.5 2 2.5 3 3.5 4 4.5 5 is a transformation that reduces the eﬀect of absolute Element number element values. Therefore, it is useful in ﬁnding the se- quences with similar changing patterns even though Fig. 1. Example of normalization transformation. their absolute values may be diﬀerent. For example, con- sider two sequences S1 and S2 in Fig. 1. Although S1 and S2 have diﬀerent element values, we see that their chang- 30000 ing patterns are quite similar. By normalization, they are S MV4(S) transformed into identical sequences Norm(S1) and MV8(S) 25000 Norm(S2). 20000 Element value Deﬁnition 2. Given a sequence S = (s[i]) (0 6 i < Len(S)) and a moving average coefﬁcient k, 15000 S is transformed into MVk(S) = (sk[j]) (0 6 j < 10000 Len(S) k + 1) by k-moving average transformation (Chatﬁeld, 1984; Kendall, 1979): 5000 Pjþk1 s½j þ s½j þ 1 þ þ s½j þ k 1 l¼j s½l sk ½j ¼ ¼ . 0 0 5 10 15 20 25 30 35 k k Element number As described in Deﬁnition 2, the k-moving average transformation generates a series of elements from the Fig. 2. Example of moving average transformation. average values of successive k elements of an original sequence. The moving average transformation reduces Deﬁnition 3. Given two sequences S and Q of the same the eﬀect of noises embedded in sequences. Therefore, length n, the distance function Lp is deﬁned as follows. it is useful to detect sequences of similar changing pat- L1 is the Manhattan distance, L2 is the Euclidean terns without worrying about noises. Users decide the distance, and L1 is the maximum distance in any pair of moving average coeﬃcient k according to the intention elements (Shim et al., 1997). of how much they want to reduce the eﬀect of noises. !1=p For example, Fig. 2 shows sequence S, and its 4- and X n p 8-moving averaged sequences MV4(S) and MV8(S). Lp ðS; QÞ ¼ js½i q½ij ; 1 6 p 6 1. i¼1 We see that the eﬀect of noises reduces as the moving average coeﬃcient k increases. Even though the distance function Lp is widely used in many applications, it has the strict restriction that the two sequences to be compared should be of the same 2 We may use a diﬀerent type of normalization (Goldin and length. Time warping is a transformation that allows Kanellakis, 1995; Raﬁei and Mendelzon, 1997) that ﬁrst computes any sequence element to replicate itself as many times the average avg(S) and the standard deviation std(S) from S, and then as needed without extra costs (Yi et al., 1998). For replaces each element s[i] with s0 ½i ¼ s½iavgðSÞ stdðSÞ . However, we employ the example, two sequences S = (20, 21, 21, 20, 20, 23, 23, 23) normalization deﬁned in Deﬁnition 1 in order to make all elements take the values within the range [1.0, 1.0]. This enables high and Q = (20, 20, 21, 20, 23) can be identically trans- compression of a subsequence tree, which is deﬁned in Section 4. formed into (20, 20, 21, 21, 20, 20, 23, 23, 23) by time Compared with using of avg(S) and std(S), the normalization deﬁned warping. The time warping distance is deﬁned as the in Deﬁnition 1 tends to be sensitive to noises embedded in sequences. smallest distance between two sequences transformed As mentioned in Section 3.2, however, our similarity model ﬁrst by time warping. eliminates such noises in each sequence through moving average transformation before doing normalization transformation. Thus, we can safely employ Min(S) and Max(S) instead of avg(S) and std(S) Deﬁnition 4. Given two sequences S and Q, the time without worrying about the noise effect. warping distance Dtw is deﬁned recursively as follows S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 195 (Rabiner and Juang, 1993). Here, Dbase can be any Lp advantage of being sensitive to noises within a sequence. function that returns the distance of two elements. As stated earlier, however, our similarity model ﬁrst per- forms the moving average transformation, thereby Dtw ððÞ; ðÞÞ ¼ 0; removing the noises in each sequence. Therefore, we Dtw ðS; ðÞÞ ¼ Dtw ððÞ; QÞ ¼ 1; do not need to worry about the noise eﬀect in applying p Dtw ðS; QÞ ¼ ðDbase ðFirstðSÞ; FirstðQÞÞÞ L 1. Applications require diﬀerent target query results þ ðMinðDtw ðS; RestðQÞÞ; Dtw ðRestðSÞ; QÞ; according to their characteristics. Also, users even in p 1=p Dtw ðRestðSÞ; RestðQÞÞÞÞ . the same application tend to want diﬀerent target query results according to their propensities. In this work, we While the Euclidean distance can be used only when propose a similarity model that supports all of L1, L2, two sequences compared are of the same length, the time and L1 as a base distance function in order to give users warping distance can be applied to any two sequences of choices at querying time. Users can choose their own arbitrary lengths. Therefore, the time warping distance distance function according to their preferences, and is smoothly applicable to the databases where sequences thus, get what they want to retrieve from a database. are of diﬀerent lengths. In Section 6, we present experimental results by diﬀerent Lp distance functions. We deﬁne the problem of the shape-based retrieval in 3.2. Similarity model a time-series database as follows: Given a query sequence Q, a distance tolerance e, a moving average coeﬃcient k The goal of this work is to devise a method that eﬀec- and p for the Lp distance function, we ﬁnd the sub- tively ﬁnds the subsequences whose shapes are similar to sequences X whose distances to Q (D(X, Q) = that of a query sequence. To support the shape-based re- (Dtw(Norm(MVk(X)), Norm(MVk(Q))))) are smaller than trieval, this paper deﬁnes the following similarity model e. As a result, we get the sequences S that contain X and that combines shifting, scaling, moving average, and the starting oﬀset of X inside S. time warping transformations together. Deﬁnition 5. Given two sequences or subsequences S and Q, their distance (or dissimilarity) is deﬁned as 4. Indexing follows: This section describes the indexing method for eﬃ- DðS; QÞ ¼ Dtw ðNormðMVk ðSÞÞ; NormðMVk ðQÞÞÞ. cient shape-based retrieval. Section 4.1 brieﬂy reviews the suﬃx tree, the underlying index structure in our As in Deﬁnition 5, the distance of S and Q is deﬁned method. Section 4.2 discusses the indexing strategy for as the time warping distance of two sequences converted utilizing the suﬃx tree. Section 4.3 describes the index by (1) k-moving average transformation, and then (2) construction steps in detail, and Section 4.4 presents normalization transformation. the technique for index compression. The shape-based retrieval based on our similarity model supports shifting, scaling, and time warping 4.1. Suﬃx tree transforms, and also minimizes the eﬀect of noises owing to moving average transformation. In particular, when A trie is a data structure for indexing a set of key- computing the time warping distance between the two words of varying sizes. A suﬃx tree (Stephen, 1994) is transformed sequences, we provide three types of Lp a trie whose set of keywords comprises the suﬃxes of (the Manhattan distance L1, the Euclidean distance L2, a single sequence. Nodes with a single outgoing edge and the maximum distance L1) as a base distance func- can be collapsed, yielding the structure known as the tion. These distance functions have the following suﬃx tree (Stephen, 1994). A suﬃx tree is generalized features (Agrawal et al., 1993; Sidiropoulos and Bros, to allow multiple sequences to be stored in the same tree, 1999; Yi and Faloutsos, 2000). and is useful to ﬁnd the subsequences exactly matched It has been known that L1 is optimal when errors are with a query sequence. A suﬃx tree does not assume additive, i.i.d. (independent, identically distributed) any distance function for its construction. Therefore, it Laplacian (or double exponential). Thus, it is more ro- can support various distance functions at query process- bust against impulsive noise. L2 is optimal in the maxi- ing time. mum likelihood sense when errors are additive, i.i.d. Each suﬃx of a sequence is represented by a leaf Gaussian. It has been the most popular dissimilarity node. For example, given sequence Si = (s[0], s[1], . . ., measure in time-series applications. L1 has an advan- s[Len(Si) 1]), its suﬃx (s[j], s[j + 1], . . ., s[Len(Si) 1]) tage that users can easily specify a tolerance without is represented by the leaf node labeled with (Si, j). The considering the length of a query sequence. It has a dis- edges are labeled with the subsequences such that 196 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 and Q is no longer a by-product of the distance compu- 1.1 2.5 tation between Norm(Sa) and Q. As a result, we have to store every possible subsequence in the tree in order to support our similarity model safely. In this paper, we 2.5 call the tree, which has the same structure as the suﬃx $ 2.5 tree but stores every possible subsequence, the subse- 1.1 1.1 $ quence tree. 2.5 $ While the suﬃx tree stores n suﬃxes from a sequence $ 1.1 $ with n elements, the subsequence tree stores nðnþ1Þ 2 subse- quences. However, the subsequence tree does not grow (S1,1) (S2,1) (S1,4) (S1,2) (S1,3) (S2,2) excessively because the subsequences from the same se- quence have the high possibility to share common Fig. 3. Suﬃx tree from S1 = (1.1, 2.5, 2.5, 1.1) and S2 = (1.1, 2.5). preﬁxes. 4.3. Construction of subsequence tree the concatenation of the edge labels on the path from the root to the leaf labeled with (Si, j) becomes suﬃx To build a subsequence tree from a time-series data- (s[j], s[j + 1], . . ., s[Len(Si) 1]). The concatenation of base, we take the following ﬁve steps. the edge labels on the path from the root to internal node N represents the longest common preﬁx of the • Step 1: Moving average transformation suﬃxes represented by the leaf nodes under N. Fig. 3 After selecting the moving average coeﬃcient k suit- shows the suﬃx tree constructed from two sequences able for a given target application, we perform k- S1 = (1.1, 2.5, 2.5, 1.1) and S2 = (1.1, 2.5). Four suﬃxes moving average transformation for every sequence ((1.1, 2.5, 2.5, 1.1), (2.5, 2.5, 1.1), (2.5, 1.1), and (1.1)) stored in a database. MVk(S) denotes the sequence from S1 and two suﬃxes ((1.1, 2.5) and (2.5)) from S2 converted from S by k-moving average are extracted and then inserted into the suﬃx tree. The transformation. symbol $ is used as the end marker of a suﬃx. • Step 2: Subsequence extraction The suﬃx tree becomes more compact when the suf- We extract every possible subsequence X from ﬁxes contain more and longer common preﬁxes. How- MVk(S). Note that X is not a subsequence of S but ever, it is rare for the suﬃxes to have common preﬁxes a subsequence of MVk(S). At this step, we can deter- because every element takes values from a continuous mine the minimum length L of subsequences to be domain. Park et al. (2000) proposed to use categoriza- extracted. That is, when it is meaningless to retrieve tion to solve this problem. Categorization divides the en- too short subsequences as answers, we extract only tire range from which elements take their values into the subsequences whose lengths are longer than or multiple non-overlapping subranges. Then, every ele- equal to L. This contributes towards reducing the size ment is converted into the symbol of its corresponding of the subsequence tree. subrange. If we build the suﬃx tree from sequences of • Step 3: Normalization transformation symbols, the suﬃxes are likely to have the common pre- We normalize every subsequence X using Max(X) ﬁxes much more than before. and Min(X). Norm(X) denotes the normalized sequence of X. 4.2. Indexing strategy • Step 4: Symbolization using categorization We ﬁrst decide a set of categories (subranges) by Consider a suﬃx Sa of a sequence S and a preﬁx Sb of examining the distribution of element values in a Sa. Since Sb is the preﬁx of Sa, every element of Sb is database after step 3, and assign a unique symbol contained in Sa. Therefore, we can obtain the distance to each category. Then, we convert each element of of Sb and Q while we compute the time warping distance Norm(X) into the symbol of the corresponding cate- between Sa and Q. Thus, if we do not consider normal- gory. The number of categories depends on a target ization, the index constructed from the suﬃxes is enough application (Park et al., 2000). for retrieving any subsequences. • Step 5: Tree construction However, our similarity model supports normaliza- We build a subsequence tree from a set of symbolized tion. As indicated in Deﬁnition 1, we use Max(S) and subsequences. Each symbolized subsequence is repre- Min(S) to normalize (sub)sequence S. Let us consider sented by the path from the root to a leaf node. The the suﬃx Sa and its preﬁx Sb again. Since Max(Sa) identiﬁer (SID, i, j) of a symbolized subsequence is and Min(Sa) can be diﬀerent from Max(Sb) and stored in the corresponding leaf node where SID is Min(Sb), we can not guarantee that Norm(Sb) is the pre- the identiﬁer of a sequence from which the symbol- ﬁx of Norm(Sa). Therefore, the distance of Norm(Sb) ized subsequence has been obtained, i is its starting S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 197 oﬀset, and j is its ending oﬀset. Since the subsequence perform the following query processing algorithm using tree is possibly constructed from a large volume of a this query sequence Q. database, we employ the disk-based algorithm (Park Algorithm 1. Similarity search algorithm using compact et al., 2000) that reduces the number of disk accesses. subsequence tree Search-CST. Input: compact subsequence tree CST, query sequence 4.4. Index compression Q, tolerance e, base distance function Lp Output: set of answers answerSet The subsequence tree does not grow excessively even 1 candidateSet :¼ VisitNode-and-FindAnswers-CST when the database becomes large. This is because the (rootNode (CST), Q, e, emptyTable, Lp); subsequence tree has a large potential to compress the 2 answerSet :¼ PostProcess (candidateSet); input subsequences. However, it is still true that the sub- 3 return answerSet; sequence tree from fewer subsequences is smaller than the one from more subsequences. This subsection pre- Algorithm 1 shows Search-CST that traverses the sents the technique, which reduces the number of subse- compact subsequence tree to retrieve the subsequences quences to be stored in the subsequence tree without whose time warping distances Dtw from query sequence losing any necessary information for query processing. Q are within the distance tolerance e. Remember that we Let us consider two subsequences Sa and Sb where Sb allow users to specify the desired Lp distance function at is a preﬁx of Sa. Norm(Sb) is still preﬁx of Norm(Sa) querying time. Therefore, Lp is given to Algorithm 1 as when Max(Sa) = Max(Sb) and Min(Sa) = Min(Sb). one of the arguments. We note that the query sequence Therefore, given a query sequence Q, the distance be- and the subsequences in the tree have been transformed tween Norm(Sb) and Q can be obtained as a by-product by k-moving average and normalization. of the distance computation between Norm(Sa) and Q. As described in Section 4.3, a compact subsequence This implies that Sb does not have to be inserted into tree is constructed from symbol sequences. Therefore, the subsequence tree as long as Sa is stored in the tree. it is not possible to compute the exact time warping dis- Let S[i:j] denote the subsequence of S including elements tance between the query sequence and the subsequence in positions i through j. The procedure to determine already converted into the symbol sequence. Therefore, whether S[i:j] is to be inserted into the tree or not is for- Search-CST uses the distance function Dtw-lb deﬁned in mally deﬁned as follows: (1) insert S[i:j] if j = Len(S) 1 Park et al. (2000) to compute the lower bound distance (that is, there is no subsequence which contains S[i:j] as between the query sequence and symbolized sequence a preﬁx), (2) insert S[i:j] if Max(S[i:j]) 5 Max(S[i:j + 1]) CS. or Min(S[i:j]) 5 Min(S[i:j + 1]). We call the tree, which stores the set of those subsequences which pass the Deﬁnition 6. Given query sequence Q and symbolized above test procedure, the compact subsequence tree. sequence CS, the lower-bound time warping distance function Dtw-lb(CS, Q) is deﬁned as follows: 5. Query processing Dtw-lb ððÞ; ðÞÞ ¼ 0; Dtw-lb ðCS; ðÞÞ ¼ Dtw-lb ððÞ; QÞ ¼ 1; This section presents the query processing method for shape-based retrieval of similar subsequences, and Dtw-lb ðCS; QÞ ¼ ðDbase-lb ðFirstðCSÞ; FirstðQÞÞÞ p shows its computation complexity. þ ðMinðDtw-lb ðCS; RestðQÞÞ; 5.1. Algorithm Dtw-lb ðRestðCSÞ; QÞ; We premise that users submit a noise-free query se- 1=p Dtw-lb ðRestðCSÞ; RestðQÞÞÞÞp quence by using either one of two ways. The ﬁrst way is to make users directly determine the element values Dbase-lb ðA; bÞ ¼ 0 ðif A.lb 6 b 6 A.ubÞ of the query sequence. The query sequence thus ob- tained is free from the noises since users draw the spe- ¼ b A.ub ðif b > A.ubÞ ciﬁc shape of a sequence that they want to ﬁnd. The second way is to make users select a subsequence from ¼ A.lb b ðif b < A.lbÞ. a sequence chosen in a database, and also transform it into a k-moving averaged subsequence. In both ways, Here, A is the symbol of First(CS) and b is the actual nu- we are free from the noise eﬀect, and thus safely regard meric value of First(Q). A.lb and A.ub denote the mini- both query sequences to be k-moving averaged. As a mum and maximum element values of the subrange next step, we normalize the query sequence, and then corresponding to the symbol A. 198 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 Algorithm Search-CST calls function VisitNode-and- its actual distance from Q using the original time warp- FindAnswers-CST (line 1) to retrieve the candidate sub- ing distance function Dtw. The subsequences whose ac- sequences whose lower bound time warping distances tual time warping distances from Q are not larger than from Q are within e. Let us consider function Visit- e are returned to a user as ﬁnal answers (line 3). Node-and-FindAnswers-CST shown in Algorithm 2. When the algorithm visits node N, it inspects each child Algorithm 2. Algorithm for traversing the compact node to ﬁnd new candidates and determines if it is nec- subsequence tree VisitNode-and-FindAnswers-CST. essary to go further down the tree. For example, suppose that the algorithm stays on node N and inspects its child node CNi. For simpler explanation, we assume that each edge is labeled by a single symbol. The algorithm builds the cumulative dis- tance table between Q and label(N, CNi), locating Q and label(N, CNi) on the X-axis and on the Y-axis, respec- tively. If N is the root node, the table is built from the bottom. Otherwise, the table is constructed by augment- ing a new row on the table that has been accumulated from the root to N. Function AddRow (line 3) adds a new row using the distance function Dtw-lb with a given Lp as a base distance function. The ﬁrst step is to inspect the last column of the newly added row to ﬁnd candi- dates (line 4). If the last column has a value not larger than e (line 5), then the algorithm extracts the identiﬁers from the leaf nodes under CNi and adds them to the candidate set (line 6). Note that there may be some data subsequences not stored in a compact subsequence tree. These subse- quences are embedded in the paths from the root to internal nodes. Therefore, the paths to internal nodes can generate candidate answers. Suppose that the path 5.2. Algorithm analysis from the root to the node CNi has a lower-bound dis- tance not larger than e from Q. Then, all the subse- Before analyzing the complexity of Search-CST, let us quences represented by this path can be easily examine the complexities of Seq-Scan and Search-ST. identiﬁed by extracting the leaf nodes under CNi. Let Seq-Scan is the sequential scan method and Search-ST LN be one of such leaf nodes. When LN has an identi- is the method using a subsequence tree as an index. ﬁer (Sk, j, j 0 ) and the length of the path from the root to Seq-Scan reads each data sequence S, performs k- CNi is l, one of the subsequences identiﬁed by this path moving average transformation for S to get MVk(S), ex- is by (Sk, j, j + l 1). tracts every possible subsequence X from MVk(S), and The next step is to determine if it is necessary to normalizes every X. Given query sequence Q and subse- go further down the tree. That is, if at least one col- quence X, it builds the cumulative distance table with umn of the newly added row has a value not greater the computation complexity of O(jQj jXj) (Berndt and than e (line 8), the search continues to go down the tree Cliﬀord, 1996). For M k-moving averaged sequences to ﬁnd more candidates. Otherwise, the search moves to whose average length is L, there are MLðLþ1Þ subsequences 2 the next child of N. When it is necessary to go further and their average length is Lþ2. Therefore, the computa- 3 down the tree, the algorithm calls itself recursively tion complexity of Seq-Scan is O(ML3jQj). (line 9). Search-ST is computationally much cheaper than Seq- Since VisitNode-and-FindAnswers-CST ﬁnds candi- Scan due to branch-pruning and sharing of common dates using the lower-bound distance function Dtw-lb, cumulative distance tables for all the subsequences that the subsequences whose actual distances are larger than have common preﬁxes. e may be contained in the candidate set. Those subse- Thus, the complexity of Search- ML3 jQj quences are called false alarms (Faloutsos et al., 1994). ST is O Rd Rp þ nLjQj . The left term is for tree traversal Therefore, Search-CST requires the post-processing step and the right term is for post-processing. Here, Rd (P1) to detect and discard false alarms. For each answer in is the reduction factor due to sharing of the cumulative the candidate set, function PostProcess (line 2) retrieves distance tables, and Rp (P1) is the reduction factor gained the corresponding subsequence, transforms it using k- from the branch-pruning. n represents the number of can- moving average and normalization, and then computes didate subsequences that require the post-processing. S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 199 The complexity of Search-CST is easily derived from RAM. The software platform is Solaris 7, which was that of Search-ST. The compact subsequence tree can be set to a single-user mode to minimize the interference traversed faster than the corresponding subsequence tree from other system and user processes. because the former is usually smaller than the latter. However, the subsequence tree has a larger reduction 6.2. Results and analyses factor Rd due to a higher possibility to share common preﬁxes. Let q (0 < q 6 1) be the compression ratio of In Experiment 1, we evaluated the eﬀectiveness and the compact subsequence tree, q = the number of stored performance of our similarity model with S&P 500 stock subsequences/the number of total subsequences. Then data set. We selected 1000 sequences of average length qML3 jQj 100 from the data set and applied the 10-moving average the complexity of Search-CST is O R0d Rp þ nLjQj . transformation. As a query sequence, we used a subse- quence of double bottom pattern, which is frequently 6. Performance evaluation used in stock data analyses. The values of distance toler- ance e were determined to retrieve 100, 200, 300 and This section presents the experimental results for per- 1000 diﬀerent answers, respectively. formance evaluation of the proposed method. Section Fig. 4 shows some instances of the double bottom 6.1 describes the environment for experiments and Sec- pattern retrieved by our similarity method. In each ﬁg- tion 6.2 shows and analyzes experimental results. ure, the dotted line represents an actual sequence in the data set, and the plain line represents its 10-moving 6.1. Environment averaged sequence. Fig. 4(a) represents the sequence se- lected for querying. We extracted a double bottom pat- All the parameter values for our experiments were se- tern from that sequence and used it as a query sequence. lected for simulating the behavior of real-world stock The remaining ﬁgures represent the answers obtained prices and query patterns issued for investment. Two from the data set. The subsequences represented by bold kinds of data sets were used for experiments: synthetic lines in Fig. 4(b)–(d) are some answers retrieved by our data set and real world stock data set. Each synthetic methods that employ L1, L2, and L1, respectively. data sequence S = h s1, s2, . . ., sni was generated by the We see that all the answers contain the double bot- following random walk expression: tom pattern whose shape is quite similar to that of the query sequence even though their actual element values si ¼ si1 þ zi . are multifarious. And these answers illustrate the char- Here, zi is an independent, identically distributed (ID) acteristics of diﬀerent distance functions well. We believe random variable that takes values in the range that the choice of appropriate distance functions for [0.1, 0.1]. The value of the ﬁrst element s1 was taken similar subsequence matching is highly application randomly within the range [1, 10]. Stock data sequences dependent and up to application engineers. were extracted from USA S&P 500 and their element As the proposed similarity model supports moving values were based on daily closing prices. As mentioned average and time warping transformations, the answers in Section 5, query sequences were randomly selected retrieved by our methods using L1, L2, and L1 distance from those obtained by k-moving averaging such data functions have many common subsequences. For exam- sequences. ple, Fig. 4(e) depicts common answers retrieved by our We have conducted performance evaluation on the three methods employing L1, L2, and L1 distance func- three diﬀerent approaches: Search-CST, Search-ST, and tions. Fig. 5 shows the results of an experiment that Seq-Scan. Search-CST represents our approach that uses compares the answers retrieved by our methods. The the compact subsequence tree and utilizes a single index stock data sequences were used as a data set and the dis- structure for all Lp (p = 1, 2, 1) distance function. tance tolerance e was determined to retrieve 100 diﬀerent Search-CST-L1, Search-CST-L2, and Search-CST-L1 rep- answers. The Venn Diagram represents the percentages resent our approaches that employ L1, L2, and L1 dis- of common answers retrieved by two or three methods. tance functions, respectively. Search-ST also represents The experimental results with more answers were similar our approach that uses just the subsequence tree. to those in Fig. 5. Search-ST-L1, Search-ST-L2, and Search-ST-L1 repre- In Experiment 2, we compared the time and space sent Search-ST methods that employ L1, L2, and L1, eﬃciency of the proposed approaches. We selected respectively. Finally, Seq-Scan is a naive sequential scan 200, 400, 600, 800, and 1000 sequences of average length method and Seq-Scan-L1, Seq-Scan-L2 and Seq-Scan-L1 100 from the stock data set. To construct the index, we represent Seq-Scan methods that employ L1, L2, and set the number of categories (C) to 60. As C increases, L1, respectively. the index size also increases but the query processing The hardware platform for the experiments is the Sun time decreases. However, when C exceeds a certain UltraSparc-10 workstation equipped with 512 MB threshold value, the index size and the query processing 200 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 Fig. 4. Example of shape-based retrieval of similar subsequences: (a) Query sequnce, (b) L1-based, (c) L2-based, (d) L1-based, and (e) L1, L2, L1-based. nodes, edges, and leaf nodes of Search-CST are reduced 50% approximately. However, because the edges of the compact subsequence tree are longer than those of the corresponding subsequence tree, Search-CST saves about 36% of storage space actually. Fig. 6 shows the average query processing time of the three methods with various values of e. We selected 200 sequences of average length 100 from the stock data set. The average length of query sequences was set to 20. The values of e were determined to retrieve 10, 30, 100, and 300 answers, respectively. Fig. 5. Percentages of common answers retrieved by our methods using L1, L2, and L1 distance functions. As expected, the average query processing times of Search-CST and Search-ST are almost same with any value of p in the Lp-based distance function. But, time become nearly constant. We decide this threshold Search-CST performs much better than Seq-Scan. For value as the optimal number of categories. In our exper- example, when the L1-based time-warping distance is iment, we chose 60 as the optimal number of categories, used, Search-CST-L1 performs better than Seq-Scan- and thus set the number of categories for all the follow- L1 about 50–117 times with diﬀerent values of e. ing experiments to 60. Search-CST-L2 performs better than Seq-Scan-L2 about Table 2 shows the index size of the proposed indices. 26–66 times and Search-CST-L1 performs better than We observe that the size of Search-ST and Search-CST Seq-Scan-L1 about 13–23 times. increases linearly as the number of sequences increases. As the values of e in L1 and L2 must be larger than In comparison with Search-ST, the numbers of internal those of L1 to retrieve the same number of answers, S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 201 Table 2 The size of the proposed indices with the increasing number of sequences Number of sequences Search algorithm Number of edges Number of internal nodes Number of leaf nodes Index size (KBytes) 200 Search-ST 997,806 997,805 819,000 44,960 Search-CST 537,176 537,175 350,300 28,483 400 Search-ST 1,992,098 1,992,097 1,638,000 90,534 Search-CST 1,098,474 1,098,473 723,721 58,373 600 Search-ST 3,000,515 3,000,514 2,457,000 136,101 Search-CST 1,649,482 1,649,481 1,081,029 87,433 800 Search-ST 3,998,892 3,998,891 3,276,000 181,170 Search-CST 2,191,398 2,191,397 1,435,043 115,903 1000 Search-ST 5,006,675 5,006,674 4,095,000 227,021 Search-CST 2,727,899 2,727,898 1,782,489 144,326 In Experiment 3, we compared the average query processing time of the two approaches while changing the average length of query sequences. We selected 200 sequences of average length 100 from the stock data set. The values of e were determined to retrieve about 100 answers. Fig. 7 shows the average query processing time of Search-CST and Seq-Scan with the increasing length of query sequences. The result shows that Search-CST performs better than Seq-Scan regardless of query sequence length. As seen from this ﬁgure, the performance improvements of Search-CST are similar to those in Fig. 6 with any value of p in the Lp-based dis- tance function. Fig. 6. Average query processing time with the increasing number of In Experiment 4, we compared the two approaches answers. with increasing number of sequences and increasing average length of sequences, respectively. We used a the query processing times of Search-CST-L1 and Search- large volume synthetic data set for this experiment. CST-L2 get longer than that of Search-CST-L1. The average length of query sequences was set to 20. With the results of Experiment 2, we decided that First, we ﬁxed the length of sequences at 100 and in- Search-CST is better than Search-ST because it solves creased the number of sequences from 2000 to 10,000. the problem of the index size while preserving the good Fig. 8 shows the query processing time of the two ap- performance. Therefore, in the following experiments, proaches with various numbers of sequences. The we only compared Search-CST with the sequential scan elapsed times of both approaches increase linearly as method. Fig. 7. Average query processing time with the increasing average Fig. 8. Average query processing time with the increasing number of length of query sequences. data sequences. 202 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 ables users to deﬁne target results in querying depending on their preferences. Also, we proposed a compressed subsequence tree and a query processing method for eﬃcient processing of the shape-based retrieval without false dismissal. The compressed subsequence tree is a compact version of a disk-based subsequence tree. An important feature of our approach is to support our similarity model based on L1, L2, and L1 with only one index structure. To verify the superiority of our approach, we per- formed a series of experiments with a real-world S&P 500 stock data set and large synthetic data sets. The re- sults reveal that our approach successfully ﬁnds all the subsequences that have the shapes similar to that of Fig. 9. Average query processing time with the increasing length of the query sequence, and also achieves several ten times data sequences. to several hundred times speedup compared with the sequential scan method. the number of sequences grows. The value of e was chosen to retrieve about 100 answers from 2000 data se- Acknowledgments quences. The result shows that Search-CST shows better performance than Seq-Scan regardless of the number of This work has been supported by Korea Research sequences. As seen from this ﬁgure, the performance Foundation with Grant KRF-2003-041-D00486, the IT improvements of Search-CST are similar to those in Research Center via Kangwon National University, Fig. 6 with any value of p in the Lp-based distance and the University Research Program (C1-2002-146-0- function. 3) of IITA. Sang-Wook Kim would like to thank Jung- Then, we ﬁxed the number of sequences at 100 and Hee Seo, Suk-Yeon Hwang, Grace (Joo-Young) Kim, increased the length of sequences from 200 to 1000. and Joo-Sung Kim for their encouragement and support. Fig. 9 shows the query processing times of the two ap- proaches with changing average length of sequences. The value of e was chosen to retrieve about 100 answers from data sequences of length 200. As shown in Fig. 9, References while the elapsed time of Seq-Scan increases rapidly, Agrawal, R., Faloutsos, C., Swami, A., 1993. Eﬃcient similarity search that of Search-CST increases quite slowly. For example, in sequence databases. In: Proc. FODO. pp. 69–84. when the L1-based time-warping distance function is Agrawal, R., Lin, K., Sawhney, H.S., Shim, K., 1995. Fast similarity used, Search-CST-L1 performs better than Seq-Scan- search in the presence of noise, scaling, and translation in time- L1 about 102–362 times. Search-CST-L2 performs better series databases. In: Proc. VLDB. pp. 490–501. than Seq-Scan-L2 about 61 to 390 times and Search- Agrawal, R., Psaila, G., Wimmers, E.L., Zäit, M., 1995. Querying shapes of histories. In: Proc. VLDB. pp. 502–514. CST-L1 performs better than Seq-Scan-L1 about 31 to Beckmann, N., Kriegel, H., Schneider, R., Seeger, B., 1990. The R*- 253 times. The performance gain gets larger as the tree: an eﬃcient and robust access method for points and length of sequence increases. rectangles. In: Proc. ACM SIGMOD. pp. 322–331. Berndt, D.J., Cliﬀord, J., 1996. Finding patterns in time series: a dynamic programming approach. In: Advances in Knowledge Discovery and Data Mining. AAAI/MIT, Cambridge, MA, pp. 7. Conclusions 229–248. Chatﬁeld, C., 1984. The Analysis of Time-series: an Introduction, third This paper discussed the problem of shape-based re- ed. Chapman and Hall, London. trieval in time-series databases. This paper deﬁned a Chen, M.S., Han, J., Yu, P.S., 1996. Data mining: an overview from new similarity model for shape-based subsequence re- database perspective. IEEE TKDE 8 (6), 866–883. Chu, K.W., Wong, M.H., 1999. Fast time-series searching with scaling trieval, and also proposed the indexing and query pro- and shifting. In: Proc. ACM PODS. pp. 237–248. cessing methods for supporting this similarity model Das, G., Gunopulos, D., Mannila, H., 1997. Finding similar time eﬃciently. series. In: Proc. PKDD, pp. 88–100. The proposed similarity model supports a combina- Faloutsos, C., Ranganathan, M., Manolopoulos, Y., 1994. Fast tion of transformations such as shifting, scaling, moving subsequence matching in time-series databases. In: Proc. ACM SIGMOD, pp. 419–429. average, and time warping, and allows users to choose Goldin, D.Q., Kanellakis, P.C., 1995. On similarity queries for time- an Lp distance function to computing the similarity be- series data: constraint speciﬁcation and implementation. In: Proc. tween the two ﬁnally-transformed sequences. Thus, it en- Constraint Programming. pp. 137–153. S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 203 Kendall, M., 1979. Time-series, second ed. Charles Griﬃn and Perng, C.S., Wang, H., Zhang, S.R., Parker, D.S., 2000. Landmarks: a Company, London. new model for similarity-based pattern querying in time series Kim, S.W., Park, S., Chu, W.W., 2001. An index-based approach for databases. In: Proc. IEEE ICDE. pp. 33–42. similarity search supporting time warping in large sequence Preparata, F.P., Shamos, M., 1985. Computational Geometry: an databases. In: Proc. IEEE ICDE. pp. 607–614. Introduction. Springer-Verlag, Berlin. Loh, W.K., Kim, S.W., Whang, K.Y., 2000. Index interpolation: an Rabiner, L., Juang, H.H., 1993. Fundamentals of Speech Recognition. approach for subsequence matching supporting normalization Prentice Hall, Englewood Cliﬀs, NJ. transform in time-series databases. In: Proc. ACM CIKM. pp. Raﬁei, D., 1999. On similarity-based queries for time series data. In: 480–487. Proc. IEEE ICDE. pp. 410–417. Loh, W.K., Kim, S.W., Whang, K.Y., 2001. Index interpolation: a Raﬁei, D., Mendelzon, A., 1997. Similarity-based queries for time- subsequence matching algorithm supporting moving average series data. In: Proc. ACM SIGMOD. pp. 13–24. transform of arbitrary order in time-series databases. IEICE Shim, K., Srikant, R., Agrawal, R., 1997. High-dimensional similarity Trans. Inf. Syst. E84-D (1), 76–86. joins. In: Proc. IEEE ICDE, April. pp. 301–311. Moon, Y.S., Whang, K.Y., Loh, W.K., 2001. Duality-based subse- Sidiropoulos, N.D., Bros, R., 1999. Mathematical programming quence matching in time-series databases. In: Proc. IEEE ICDE. algorithms for regression-based non-linear ﬁltering in. IEEE Trans. pp. 263–272. Signal Process. (Mar). Park, S., Chu, W.W., Yoon, J., Hsu, C., 2000. Eﬃcient searches for Stephen, G.A., 1994. String Searching Algorithms. World Scientiﬁc similar subsequences of diﬀerent lengths in sequence databases. Publishing, Singapore. In: Proc. IEEE ICDE. pp. 23–32. Yi, B.K., Faloutsos, C., 2000. Fast time sequence indexing for Park, S., Kim, S.W., Cho, J.S., Padmanabhan, S., 2001. Preﬁx- arbitrary Lp norms. In: Proc. VLDB. pp. 385–394. querying: an approach for eﬀective subsequence matching under Yi, B.-K., Jagadish, H.V., Faloutsos, C., 1998. Eﬃcient retrieval of time warping in sequence databases. In: Proc. ACM CIKM. pp. similar time sequences under time warping. In: Proc. IEEE ICDE. 255–262. pp. 201–208.

References (29)

Agrawal, R., Faloutsos, C., Swami, A., 1993. Efficient similarity search in sequence databases. In: Proc. FODO. pp. 69-84.
Agrawal, R., Lin, K., Sawhney, H.S., Shim, K., 1995. Fast similarity search in the presence of noise, scaling, and translation in time- series databases. In: Proc. VLDB. pp. 490-501.
Agrawal, R., Psaila, G., Wimmers, E.L., Za ¨it, M., 1995. Querying shapes of histories. In: Proc. VLDB. pp. 502-514.
Beckmann, N., Kriegel, H., Schneider, R., Seeger, B., 1990. The R * - tree: an efficient and robust access method for points and rectangles. In: Proc. ACM SIGMOD. pp. 322-331.
Berndt, D.J., Clifford, J., 1996. Finding patterns in time series: a dynamic programming approach. In: Advances in Knowledge Discovery and Data Mining. AAAI/MIT, Cambridge, MA, pp. 229-248.
Chatfield, C., 1984. The Analysis of Time-series: an Introduction, third ed. Chapman and Hall, London.
Chen, M.S., Han, J., Yu, P.S., 1996. Data mining: an overview from database perspective. IEEE TKDE 8 (6), 866-883.
Chu, K.W., Wong, M.H., 1999. Fast time-series searching with scaling and shifting. In: Proc. ACM PODS. pp. 237-248.
Das, G., Gunopulos, D., Mannila, H., 1997. Finding similar time series. In: Proc. PKDD, pp. 88-100.
Faloutsos, C., Ranganathan, M., Manolopoulos, Y., 1994. Fast subsequence matching in time-series databases. In: Proc. ACM SIGMOD, pp. 419-429.
Goldin, D.Q., Kanellakis, P.C., 1995. On similarity queries for time- series data: constraint specification and implementation. In: Proc. Constraint Programming. pp. 137-153.
Fig. 9. Average query processing time with the increasing length of data sequences.
Kendall, M., 1979. Time-series, second ed. Charles and Company, London.
Kim, S.W., Park, S., Chu, W.W., 2001. An index-based approach for similarity search supporting time warping in large sequence databases. In: Proc. IEEE ICDE. pp. 607-614.
Loh, W.K., Kim, S.W., Whang, K.Y., 2000. Index interpolation: an approach for subsequence matching supporting normalization transform in time-series databases. In: Proc. ACM CIKM. pp. 480-487.
Loh, W.K., Kim, S.W., Whang, K.Y., 2001. Index interpolation: a subsequence matching algorithm supporting moving average transform of arbitrary order in time-series databases. IEICE Trans. Inf. Syst. E84-D (1), 76-86.
Moon, Y.S., Whang, K.Y., Loh, W.K., 2001. Duality-based subse- quence matching in time-series databases. In: Proc. IEEE ICDE. pp. 263-272.
Park, S., Chu, W.W., Yoon, J., Hsu, C., 2000. Efficient searches for similar subsequences of different lengths in sequence databases. In: Proc. IEEE ICDE. pp. 23-32.
Park, S., Kim, S.W., Cho, J.S., Padmanabhan, S., 2001. Prefix- querying: an approach for effective subsequence matching under time warping in sequence databases. In: Proc. ACM CIKM. pp. 255-262.
Perng, C.S., Wang, H., Zhang, S.R., Parker, D.S., 2000. Landmarks: a new model for similarity-based pattern querying in time series databases. In: Proc. IEEE ICDE. pp. 33-42.
Preparata, F.P., Shamos, M., 1985. Computational Geometry: an Introduction. Springer-Verlag, Berlin.
Rabiner, L., Juang, H.H., 1993. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, NJ.
Rafiei, D., 1999. On similarity-based queries for time series data. In: Proc. IEEE ICDE. pp. 410-417.
Rafiei, D., Mendelzon, A., 1997. Similarity-based queries for time- series data. In: Proc. ACM SIGMOD. pp. 13-24.
Shim, K., Srikant, R., Agrawal, R., 1997. High-dimensional similarity joins. In: Proc. IEEE ICDE, April. pp. 301-311.
Sidiropoulos, N.D., Bros, R., 1999. Mathematical programming algorithms for regression-based non-linear filtering in. IEEE Trans. Signal Process. (Mar).
Stephen, G.A., 1994. String Searching Algorithms. World Scientific Publishing, Singapore.
Yi, B.K., Faloutsos, C., 2000. Fast time sequence indexing for arbitrary L p norms. In: Proc. VLDB. pp. 385-394.
Yi, B.-K., Jagadish, H.V., Faloutsos, C., 1998. Efficient retrieval of similar time sequences under time warping. In: Proc. IEEE ICDE. pp. 201-208.

Shape-based retrieval in time-series databases

Sign up for access to the world's latest research

Abstract

Related papers

References (29)

Related papers

Related topics