The Journal of Systems and Software 79 (2006) 191–203
www.elsevier.com/locate/jss
Shape-based retrieval in time-series databases
a,*
Sang-Wook Kim , Jeehee Yoon b, Sanghyun Park c, Jung-Im Won c
a
College of Information and Communications, Hanyang University, 17 Haengdang, Seongdong, Seoul 133-791, Republic of Korea
b
Division of Information Engineering and Telecommunications, Hallym University, 39 Hallymdaehak-gil, Chuncheon,
Kangwon 200-702, Republic of Korea
c
Department of Computer Science, Yonsei University, 134 Sinchon, Seodaemoon Gu, Seoul 120-749, Republic of Korea
Received 24 May 2004; received in revised form 5 May 2005; accepted 7 May 2005
Available online 22 June 2005
Abstract
The shape-based retrieval is defined as the operation that searches for the (sub)sequences whose shapes are similar to that of a
query sequence regardless of their actual element values. In this paper, we propose a similarity model suitable for shape-based retrie-
val and present an indexing method for supporting the similarity model. The proposed similarity model enables to retrieve similar
shapes accurately by providing the combination of multiple shape-preserving transformations such as normalization, moving aver-
age, and time warping. Our indexing method stores every distinct subsequence concisely into the disk-based suffix tree for efficient
and adaptive query processing. We allow the user to dynamically choose a similarity model suitable for a given application. More
specifically, we allow the user to determine the parameter p of the distance function Lp when submitting a query. The result of exten-
sive experiments revealed that our approach not only successfully finds the subsequences whose shapes are similar to a query shape
but also significantly outperforms the sequential scan method.
2005 Elsevier Inc. All rights reserved.
Keywords: Similarity search; Shape-based retrieval; Time-series databases
1. Introduction 1993, 1995a; Faloutsos et al., 1994). Similarity search
is of growing importance in many new applications such
The time-series database is a set of data sequences as data mining and data warehousing (Chen et al., 1996;
(hereafter, we simply call them sequences), each of Rafiei and Mendelzon, 1997).
which is an ordered list of elements (Agrawal et al., In order to measure the similarity of any two
1993). Sequences of stock prices, money exchange rates, sequences of length n, most approaches (Agrawal
temperature data, product sales data, and company et al., 1993; Chu and Wong, 1999; Faloutsos et al.,
growth rates are the typical examples of time-series dat- 1994; Goldin and Kanellakis, 1995; Rafiei and Mendel-
abases (Agrawal et al., 1995a; Faloutsos et al., 1994). zon, 1997; Rafiei, 1999) map the sequences into points
Similarity search is an operation that finds sequences in n-dimensional space and compute the Euclidean dis-
or subsequences whose changing patterns are similar tance between those points as a similarity measure.
to that of a given query sequence (Agrawal et al., However, they often miss the data sequences that are
actually similar to a query sequence in usersÕ perspective.
*
Therefore, recent work on similarity search tends to sup-
Corresponding author. Tel.: +82 2 2220 1736; fax: +82 2 2220 port various types of transformations such as scaling
1886.
E-mail addresses:
[email protected] (S.-W. Kim), jhyoon@ (Agrawal et al., 1995a; Chu and Wong, 1999), shifting
hallym.ac.kr (J. Yoon),
[email protected] (S. Park), jiwon@ (Agrawal et al., 1995a; Chu and Wong, 1999), normali-
cs.yonsei.ac.kr (J.-I. Won). zation (Agrawal et al., 1995a; Chu and Wong, 1999; Das
0164-1212/$ - see front matter 2005 Elsevier Inc. All rights reserved.
doi:10.1016/j.jss.2005.05.004
192 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203
et al., 1997; Goldin and Kanellakis, 1995; Loh et al., tion 3 defines the notation and terminology used in this
2001), moving average (Loh et al., 2000; Rafiei and paper and introduces our similarity model. Section 4
Mendelzon, 1997; Rafiei, 1999), and time warping presents the indexing method for supporting the pro-
(Berndt and Clifford, 1996; Kim et al., 2001; Park posed similarity model, and Section 5 describes our
et al., 2000, 2001; Yi et al., 1998). query processing method. Section 6 presents the experi-
This paper addresses the problem of shape-based re- mental results to show the superiority of our method,
trieval that finds the sequences whose shapes are similar and finally, Section 7 summarizes and concludes the
to that of a given query sequence regardless of their actual paper.
element values. To provide a flexible solution to this prob-
lem, this paper introduces a new similarity model that em-
ploys combinations of multiple transformations such as 2. Related work
shifting, scaling, moving average, and time warping.
In particular, our similarity model supports multiple In this section, we briefly survey previous research re-
Lp distance functions in order to measure the similarity sults associated with similarity search in time-series
between the finally transformed two sequences; If a user databases.
chooses one among the Manhattan distance L1, the Agrawal et al. (1993) proposed a method for whole
Euclidean distance L2, and the maximum distance L1, matching in time-series databases. First, each data
the proposed method performs the shape-based retrieval sequence of length n is transformed into a point in f
by using the chosen distance function. The flexibility in (n) dimensional space by using the discrete Fourier
choosing multiple distance functions is fairly useful since transform (DFT). For indexing a large number of such
users could have different opinions about similarity points, an R*-tree (Beckmann et al., 1990) is used. For
depending on applications. In addition, an important whole matching, a query sequence of length l is also
feature of the proposed method is to perform the transformed into a point over f-dimensional space in
shape-based retrieval that supports these three distance the same way, and the R*-tree is traversed to perform
functions by using only one index built in advance. the range query by using the transformed query point.
Similarity search is classified into whole matching and As a result, candidate sequences, which are highly likely
subsequence matching (Agrawal et al., 1993). to be the final answers, are found. Then, every candidate
sequence is accessed from disk, and its actual distance to
• Whole matching: Given N data sequences S1, . . ., SN, the query sequence is computed. If the distance is smal-
a query sequence Q, and a tolerance e, we find such ler than e, the candidate sequence is returned as a final
data sequences Si that are similar to Q. Here, we note answer.
that the data and query sequences should be of the Faloutsos et al. (1994) and Moon et al. (2001) pro-
same length. posed methods for processing of subsequence matching.
• Subsequence matching: Given N data sequences S1, . . ., They use the concept of a window, which is a fixed-sized
SN of varying lengths, a query sequence Q, and the tol- subsequence inside the query and data sequences. Each
erance e, we find all the sequences Si, one or more sub- window of length w is transformed into a window point
sequences of which are similar to Q, and the offsets in in f (w) dimensional space by using the DFT or the
Si of those subsequences. Here, the data and query wavelet transform. Such window points are indexed by
sequences are allowed to be of arbitrary lengths. an R*-tree (Beckmann et al., 1990). For subsequence
matching, windows are extracted from a query sequence
Since subsequence matching is a generalization of and are transformed into window points in f-dimen-
whole matching, it is applicable to practical applications sional space. For each window point, a range query is
more than whole matching. performed on the R*-tree to obtain candidate subse-
In this paper, we propose a novel method for process- quences, each of which has a high possibility to be
ing of shape-based subsequence retrieval. We first define included in the final result. Finally, each candidate sub-
an effective similarity model for shape-based subsequence sequence is accessed from disk, and its actual Euclidean
retrieval and present the indexing and query processing distance to the query sequence is examined.
methods for performing the shape-based retrieval that In some applications, similarity search based only on
supports this model. To verify the superiority of the ap- Euclidean distance often fails to search for the data se-
proach, we perform extensive experiments by using a quences that are actually similar to a query sequence
variety of data sets. The results reveal that our approach in usersÕ perspective. Therefore, to give flexibility on
successfully finds all the subsequences that have the defining of the similarity, recent work tends to support
shapes similar to that of the query sequence, and also various types of transformations.
achieves high search performance. Agrawal et al. (1995a), Chu and Wong (1999), Das
This paper is organized as follows. Section 2 briefly et al. (1997), Goldin and Kanellakis (1995), Loh et al.
reviews previous work related to similarity search. Sec- (2000) and Rafiei (1999) proposed the methods for sim-
S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 193
ilarity search that supports normalization. Normaliza- Table 1
tion enables finding sequences that have a fluctuation Notation
pattern similar to that of a query sequence even through Notation Description
they are not close to each other before the normaliza- S = (s[i]) Data sequence (0 6 i < Len(S))
tion. Das et al. (1997) suggested a whole matching Len(S) Number of elements in S
method that finds similar sequences by accessing an X = (x[i]) Arbitrary subsequence of S (0 6 i < Len(X) 6 Len(S))
Q = (q[i]) Query sequence (0 6 i < Len(Q))
entire database. Also, Agrawal et al. (1995a) and Goldin e Distance tolerance
and Kanellakis (1995) proposed methods that exploit First(S) The first element of S, s[0]
R*-trees as indexes to improve the search performance Rest(S) Sequence composed of all the elements of S except for
significantly. While Agrawal et al. (1995a), Das et al. the first element, (s[1], s[2], . . ., s[Len(S) 1])
(1997) and Goldin and Kanellakis (1995) dealt with only Max(S) The largest element of S
Min(S) The smallest element of S
whole matching, Chu and Wong (1999) and Loh et al. () Empty sequence
(2000) extended their idea to process subsequence Max(a, b) The largest value between a and b
matching effectively by using indexes. Min(a, b, c) The smallest value among a, b, and c
Loh et al. (2001), Rafiei and Mendelzon (1997) and
Rafiei (1999) dealt with the methods for similarity Agrawal et al. (1995b) allowed a user to specify a re-
search that supports moving average transformation. quired query pattern using their shape definition lan-
The moving average transformation converts a given guage (SDL). Given a query pattern defined by SDL,
data sequence into a new sequence consisting of the they utilized the hierarchical index structure to retrieve
averages of k consecutive values in the data sequence, the subsequences satisfying the query pattern. This
where k is called the moving average coefficient. The method converts numeric elements into their corre-
moving average transformation is very useful for finding sponding symbols and compares the symbol sequences
the trend of the time-series data by reducing the effect of without considering their original element values. Two
noise inside. Rafiei and Mendelzon (1997) proposed a numeric elements whose values are very close to each
whole matching method that employs the convolution other could be converted into different symbols in this
definition (Preparata and Shamos, 1985) to support method. Therefore, this method may judge quite similar
moving average transformation of arbitrary coefficient sequences to be dissimilar.1
using only one index. Loh et al. (2000) pointed out the Perng et al. (2000) proposed the landmark model for
problems in applying the previous methods to subse- shape-based pattern matching. The landmark model ex-
quence matching, and suggested a method that performs tracts the landmarks from the sequences, and compares
subsequence matching effectively by using the concept of any two sequences using their landmarks without exam-
index interpolation. ining their original element values. The landmark model
Berndt and Clifford (1996), Kim et al. (2001), Park enables to find the sequences of similar shapes intui-
et al. (2000, 2001), Rafiei and Mendelzon (1997), and tively. However, the landmarks are highly dependent
Yi et al. (1998) addressed the issue of supporting time on target applications, thus making it complicated to
warping within the similarity model. Time warping is a find the landmarks suitable for a specific application.
transformation that allows any sequence element to rep-
licate itself as many times as needed, and is useful in sit- 3. Problem definition
uations where sequences are of varying lengths and so
their similarity cannot be directly computed by applying This section formally defines the problem we are
the Lp distance function. Berndt and Clifford (1996) and going to solve. Section 3.1 defines the notation and
Yi et al. (1998) proposed whole matching methods, terminology used in this paper. Section 3.2 describes
which access and examine every sequence in databases. our similarity model.
Kim et al. (2001), Park et al. (2000), and Yi et al.
(1998) discussed approaches that exploit indexes to per- 3.1. Notation and terminology
form whole matching efficiently. Also, Park et al. (2001)
extended the basic idea of Kim et al. (2001) to process Table 1 defines the notation frequently used in this
subsequence matching effectively by using the concept paper. A data sequence is a sequence stored in a database
of prefix-querying.
As stated earlier, most previous approaches support
only one transformation in their similarity model. In 1
By the categorization stated in Section 4, our method would also
time-series database applications, however, it is fre- convert similar values into different symbols. Unlike the method in
quently required to retrieve data (sub)sequences whose Agrawal et al. (1995b) that only compares two symbolized sequences
without considering their values, our method makes value-based
shapes are similar to that of a query sequence regardless comparisons in index traversal by using a lower-bound distance
of their actual element values. In this paper, we define function Dtw-lb. Thus, our method does not make such a wrong
this kind of a problem as the shape-based retrieval. judgment due to the symbolization.
194 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203
and a query sequence is a sequence submitted for S1
6 S2
querying. Norm(S1)
Norm(S2)
4
Definition 1. Given a sequence S = (s[i]) (0 6 i <
Len(S)), its normalized sequence Norm(S) = (s 0 [i])
Element value
2
(0 6 i < Len(S)) is defined as follows (Agrawal et al.,
1995a)2: 0
s½i MaxðSÞþMinðSÞ
s0 ½i ¼ 2
MaxðSÞMinðSÞ
. -2
2
-4
Normalization, a combination of shifting and scaling,
1 1.5 2 2.5 3 3.5 4 4.5 5
is a transformation that reduces the effect of absolute
Element number
element values. Therefore, it is useful in finding the se-
quences with similar changing patterns even though Fig. 1. Example of normalization transformation.
their absolute values may be different. For example, con-
sider two sequences S1 and S2 in Fig. 1. Although S1 and
S2 have different element values, we see that their chang-
30000
ing patterns are quite similar. By normalization, they are S
MV4(S)
transformed into identical sequences Norm(S1) and MV8(S)
25000
Norm(S2).
20000
Element value
Definition 2. Given a sequence S = (s[i])
(0 6 i < Len(S)) and a moving average coefficient k, 15000
S is transformed into MVk(S) = (sk[j]) (0 6 j <
10000
Len(S) k + 1) by k-moving average transformation
(Chatfield, 1984; Kendall, 1979): 5000
Pjþk1
s½j þ s½j þ 1 þ þ s½j þ k 1 l¼j s½l
sk ½j ¼ ¼ . 0
0 5 10 15 20 25 30 35
k k
Element number
As described in Definition 2, the k-moving average
transformation generates a series of elements from the Fig. 2. Example of moving average transformation.
average values of successive k elements of an original
sequence. The moving average transformation reduces Definition 3. Given two sequences S and Q of the same
the effect of noises embedded in sequences. Therefore, length n, the distance function Lp is defined as follows.
it is useful to detect sequences of similar changing pat- L1 is the Manhattan distance, L2 is the Euclidean
terns without worrying about noises. Users decide the distance, and L1 is the maximum distance in any pair of
moving average coefficient k according to the intention elements (Shim et al., 1997).
of how much they want to reduce the effect of noises. !1=p
For example, Fig. 2 shows sequence S, and its 4- and X n
p
8-moving averaged sequences MV4(S) and MV8(S). Lp ðS; QÞ ¼ js½i q½ij ; 1 6 p 6 1.
i¼1
We see that the effect of noises reduces as the moving
average coefficient k increases. Even though the distance function Lp is widely used
in many applications, it has the strict restriction that
the two sequences to be compared should be of the same
2
We may use a different type of normalization (Goldin and length. Time warping is a transformation that allows
Kanellakis, 1995; Rafiei and Mendelzon, 1997) that first computes any sequence element to replicate itself as many times
the average avg(S) and the standard deviation std(S) from S, and then as needed without extra costs (Yi et al., 1998). For
replaces each element s[i] with s0 ½i ¼ s½iavgðSÞ
stdðSÞ . However, we employ the example, two sequences S = (20, 21, 21, 20, 20, 23, 23, 23)
normalization defined in Definition 1 in order to make all elements
take the values within the range [1.0, 1.0]. This enables high
and Q = (20, 20, 21, 20, 23) can be identically trans-
compression of a subsequence tree, which is defined in Section 4. formed into (20, 20, 21, 21, 20, 20, 23, 23, 23) by time
Compared with using of avg(S) and std(S), the normalization defined warping. The time warping distance is defined as the
in Definition 1 tends to be sensitive to noises embedded in sequences. smallest distance between two sequences transformed
As mentioned in Section 3.2, however, our similarity model first by time warping.
eliminates such noises in each sequence through moving average
transformation before doing normalization transformation. Thus, we
can safely employ Min(S) and Max(S) instead of avg(S) and std(S) Definition 4. Given two sequences S and Q, the time
without worrying about the noise effect. warping distance Dtw is defined recursively as follows
S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 195
(Rabiner and Juang, 1993). Here, Dbase can be any Lp advantage of being sensitive to noises within a sequence.
function that returns the distance of two elements. As stated earlier, however, our similarity model first per-
forms the moving average transformation, thereby
Dtw ððÞ; ðÞÞ ¼ 0; removing the noises in each sequence. Therefore, we
Dtw ðS; ðÞÞ ¼ Dtw ððÞ; QÞ ¼ 1; do not need to worry about the noise effect in applying
p
Dtw ðS; QÞ ¼ ðDbase ðFirstðSÞ; FirstðQÞÞÞ L 1.
Applications require different target query results
þ ðMinðDtw ðS; RestðQÞÞ; Dtw ðRestðSÞ; QÞ;
according to their characteristics. Also, users even in
p 1=p
Dtw ðRestðSÞ; RestðQÞÞÞÞ . the same application tend to want different target query
results according to their propensities. In this work, we
While the Euclidean distance can be used only when propose a similarity model that supports all of L1, L2,
two sequences compared are of the same length, the time and L1 as a base distance function in order to give users
warping distance can be applied to any two sequences of choices at querying time. Users can choose their own
arbitrary lengths. Therefore, the time warping distance distance function according to their preferences, and
is smoothly applicable to the databases where sequences thus, get what they want to retrieve from a database.
are of different lengths. In Section 6, we present experimental results by different
Lp distance functions.
We define the problem of the shape-based retrieval in
3.2. Similarity model a time-series database as follows: Given a query sequence
Q, a distance tolerance e, a moving average coefficient k
The goal of this work is to devise a method that effec- and p for the Lp distance function, we find the sub-
tively finds the subsequences whose shapes are similar to sequences X whose distances to Q (D(X, Q) =
that of a query sequence. To support the shape-based re- (Dtw(Norm(MVk(X)), Norm(MVk(Q))))) are smaller than
trieval, this paper defines the following similarity model e. As a result, we get the sequences S that contain X and
that combines shifting, scaling, moving average, and the starting offset of X inside S.
time warping transformations together.
Definition 5. Given two sequences or subsequences S
and Q, their distance (or dissimilarity) is defined as 4. Indexing
follows:
This section describes the indexing method for effi-
DðS; QÞ ¼ Dtw ðNormðMVk ðSÞÞ; NormðMVk ðQÞÞÞ. cient shape-based retrieval. Section 4.1 briefly reviews
the suffix tree, the underlying index structure in our
As in Definition 5, the distance of S and Q is defined
method. Section 4.2 discusses the indexing strategy for
as the time warping distance of two sequences converted
utilizing the suffix tree. Section 4.3 describes the index
by (1) k-moving average transformation, and then (2)
construction steps in detail, and Section 4.4 presents
normalization transformation.
the technique for index compression.
The shape-based retrieval based on our similarity
model supports shifting, scaling, and time warping 4.1. Suffix tree
transforms, and also minimizes the effect of noises owing
to moving average transformation. In particular, when A trie is a data structure for indexing a set of key-
computing the time warping distance between the two words of varying sizes. A suffix tree (Stephen, 1994) is
transformed sequences, we provide three types of Lp a trie whose set of keywords comprises the suffixes of
(the Manhattan distance L1, the Euclidean distance L2, a single sequence. Nodes with a single outgoing edge
and the maximum distance L1) as a base distance func- can be collapsed, yielding the structure known as the
tion. These distance functions have the following suffix tree (Stephen, 1994). A suffix tree is generalized
features (Agrawal et al., 1993; Sidiropoulos and Bros, to allow multiple sequences to be stored in the same tree,
1999; Yi and Faloutsos, 2000). and is useful to find the subsequences exactly matched
It has been known that L1 is optimal when errors are with a query sequence. A suffix tree does not assume
additive, i.i.d. (independent, identically distributed) any distance function for its construction. Therefore, it
Laplacian (or double exponential). Thus, it is more ro- can support various distance functions at query process-
bust against impulsive noise. L2 is optimal in the maxi- ing time.
mum likelihood sense when errors are additive, i.i.d. Each suffix of a sequence is represented by a leaf
Gaussian. It has been the most popular dissimilarity node. For example, given sequence Si = (s[0], s[1], . . .,
measure in time-series applications. L1 has an advan- s[Len(Si) 1]), its suffix (s[j], s[j + 1], . . ., s[Len(Si) 1])
tage that users can easily specify a tolerance without is represented by the leaf node labeled with (Si, j). The
considering the length of a query sequence. It has a dis- edges are labeled with the subsequences such that
196 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203
and Q is no longer a by-product of the distance compu-
1.1 2.5
tation between Norm(Sa) and Q. As a result, we have to
store every possible subsequence in the tree in order to
support our similarity model safely. In this paper, we
2.5 call the tree, which has the same structure as the suffix
$ 2.5
tree but stores every possible subsequence, the subse-
1.1 1.1 $ quence tree.
2.5 $ While the suffix tree stores n suffixes from a sequence
$
1.1 $
with n elements, the subsequence tree stores nðnþ1Þ
2
subse-
quences. However, the subsequence tree does not grow
(S1,1) (S2,1) (S1,4) (S1,2) (S1,3) (S2,2)
excessively because the subsequences from the same se-
quence have the high possibility to share common
Fig. 3. Suffix tree from S1 = (1.1, 2.5, 2.5, 1.1) and S2 = (1.1, 2.5).
prefixes.
4.3. Construction of subsequence tree
the concatenation of the edge labels on the path from
the root to the leaf labeled with (Si, j) becomes suffix To build a subsequence tree from a time-series data-
(s[j], s[j + 1], . . ., s[Len(Si) 1]). The concatenation of base, we take the following five steps.
the edge labels on the path from the root to internal
node N represents the longest common prefix of the • Step 1: Moving average transformation
suffixes represented by the leaf nodes under N. Fig. 3 After selecting the moving average coefficient k suit-
shows the suffix tree constructed from two sequences able for a given target application, we perform k-
S1 = (1.1, 2.5, 2.5, 1.1) and S2 = (1.1, 2.5). Four suffixes moving average transformation for every sequence
((1.1, 2.5, 2.5, 1.1), (2.5, 2.5, 1.1), (2.5, 1.1), and (1.1)) stored in a database. MVk(S) denotes the sequence
from S1 and two suffixes ((1.1, 2.5) and (2.5)) from S2 converted from S by k-moving average
are extracted and then inserted into the suffix tree. The transformation.
symbol $ is used as the end marker of a suffix. • Step 2: Subsequence extraction
The suffix tree becomes more compact when the suf- We extract every possible subsequence X from
fixes contain more and longer common prefixes. How- MVk(S). Note that X is not a subsequence of S but
ever, it is rare for the suffixes to have common prefixes a subsequence of MVk(S). At this step, we can deter-
because every element takes values from a continuous mine the minimum length L of subsequences to be
domain. Park et al. (2000) proposed to use categoriza- extracted. That is, when it is meaningless to retrieve
tion to solve this problem. Categorization divides the en- too short subsequences as answers, we extract only
tire range from which elements take their values into the subsequences whose lengths are longer than or
multiple non-overlapping subranges. Then, every ele- equal to L. This contributes towards reducing the size
ment is converted into the symbol of its corresponding of the subsequence tree.
subrange. If we build the suffix tree from sequences of • Step 3: Normalization transformation
symbols, the suffixes are likely to have the common pre- We normalize every subsequence X using Max(X)
fixes much more than before. and Min(X). Norm(X) denotes the normalized
sequence of X.
4.2. Indexing strategy • Step 4: Symbolization using categorization
We first decide a set of categories (subranges) by
Consider a suffix Sa of a sequence S and a prefix Sb of examining the distribution of element values in a
Sa. Since Sb is the prefix of Sa, every element of Sb is database after step 3, and assign a unique symbol
contained in Sa. Therefore, we can obtain the distance to each category. Then, we convert each element of
of Sb and Q while we compute the time warping distance Norm(X) into the symbol of the corresponding cate-
between Sa and Q. Thus, if we do not consider normal- gory. The number of categories depends on a target
ization, the index constructed from the suffixes is enough application (Park et al., 2000).
for retrieving any subsequences. • Step 5: Tree construction
However, our similarity model supports normaliza- We build a subsequence tree from a set of symbolized
tion. As indicated in Definition 1, we use Max(S) and subsequences. Each symbolized subsequence is repre-
Min(S) to normalize (sub)sequence S. Let us consider sented by the path from the root to a leaf node. The
the suffix Sa and its prefix Sb again. Since Max(Sa) identifier (SID, i, j) of a symbolized subsequence is
and Min(Sa) can be different from Max(Sb) and stored in the corresponding leaf node where SID is
Min(Sb), we can not guarantee that Norm(Sb) is the pre- the identifier of a sequence from which the symbol-
fix of Norm(Sa). Therefore, the distance of Norm(Sb) ized subsequence has been obtained, i is its starting
S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 197
offset, and j is its ending offset. Since the subsequence perform the following query processing algorithm using
tree is possibly constructed from a large volume of a this query sequence Q.
database, we employ the disk-based algorithm (Park
Algorithm 1. Similarity search algorithm using compact
et al., 2000) that reduces the number of disk accesses.
subsequence tree Search-CST.
Input: compact subsequence tree CST, query sequence
4.4. Index compression
Q, tolerance e, base distance function Lp
Output: set of answers answerSet
The subsequence tree does not grow excessively even
1 candidateSet :¼ VisitNode-and-FindAnswers-CST
when the database becomes large. This is because the
(rootNode (CST), Q, e, emptyTable, Lp);
subsequence tree has a large potential to compress the
2 answerSet :¼ PostProcess (candidateSet);
input subsequences. However, it is still true that the sub-
3 return answerSet;
sequence tree from fewer subsequences is smaller than
the one from more subsequences. This subsection pre-
Algorithm 1 shows Search-CST that traverses the
sents the technique, which reduces the number of subse-
compact subsequence tree to retrieve the subsequences
quences to be stored in the subsequence tree without
whose time warping distances Dtw from query sequence
losing any necessary information for query processing.
Q are within the distance tolerance e. Remember that we
Let us consider two subsequences Sa and Sb where Sb
allow users to specify the desired Lp distance function at
is a prefix of Sa. Norm(Sb) is still prefix of Norm(Sa)
querying time. Therefore, Lp is given to Algorithm 1 as
when Max(Sa) = Max(Sb) and Min(Sa) = Min(Sb).
one of the arguments. We note that the query sequence
Therefore, given a query sequence Q, the distance be-
and the subsequences in the tree have been transformed
tween Norm(Sb) and Q can be obtained as a by-product
by k-moving average and normalization.
of the distance computation between Norm(Sa) and Q.
As described in Section 4.3, a compact subsequence
This implies that Sb does not have to be inserted into
tree is constructed from symbol sequences. Therefore,
the subsequence tree as long as Sa is stored in the tree.
it is not possible to compute the exact time warping dis-
Let S[i:j] denote the subsequence of S including elements
tance between the query sequence and the subsequence
in positions i through j. The procedure to determine
already converted into the symbol sequence. Therefore,
whether S[i:j] is to be inserted into the tree or not is for-
Search-CST uses the distance function Dtw-lb defined in
mally defined as follows: (1) insert S[i:j] if j = Len(S) 1
Park et al. (2000) to compute the lower bound distance
(that is, there is no subsequence which contains S[i:j] as
between the query sequence and symbolized sequence
a prefix), (2) insert S[i:j] if Max(S[i:j]) 5 Max(S[i:j + 1])
CS.
or Min(S[i:j]) 5 Min(S[i:j + 1]). We call the tree, which
stores the set of those subsequences which pass the Definition 6. Given query sequence Q and symbolized
above test procedure, the compact subsequence tree. sequence CS, the lower-bound time warping distance
function Dtw-lb(CS, Q) is defined as follows:
5. Query processing Dtw-lb ððÞ; ðÞÞ ¼ 0;
Dtw-lb ðCS; ðÞÞ ¼ Dtw-lb ððÞ; QÞ ¼ 1;
This section presents the query processing method for
shape-based retrieval of similar subsequences, and Dtw-lb ðCS; QÞ ¼ ðDbase-lb ðFirstðCSÞ; FirstðQÞÞÞ
p
shows its computation complexity.
þ ðMinðDtw-lb ðCS; RestðQÞÞ;
5.1. Algorithm
Dtw-lb ðRestðCSÞ; QÞ;
We premise that users submit a noise-free query se- 1=p
Dtw-lb ðRestðCSÞ; RestðQÞÞÞÞp
quence by using either one of two ways. The first way
is to make users directly determine the element values
Dbase-lb ðA; bÞ ¼ 0 ðif A.lb 6 b 6 A.ubÞ
of the query sequence. The query sequence thus ob-
tained is free from the noises since users draw the spe- ¼ b A.ub ðif b > A.ubÞ
cific shape of a sequence that they want to find. The
second way is to make users select a subsequence from ¼ A.lb b ðif b < A.lbÞ.
a sequence chosen in a database, and also transform it
into a k-moving averaged subsequence. In both ways, Here, A is the symbol of First(CS) and b is the actual nu-
we are free from the noise effect, and thus safely regard meric value of First(Q). A.lb and A.ub denote the mini-
both query sequences to be k-moving averaged. As a mum and maximum element values of the subrange
next step, we normalize the query sequence, and then corresponding to the symbol A.
198 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203
Algorithm Search-CST calls function VisitNode-and- its actual distance from Q using the original time warp-
FindAnswers-CST (line 1) to retrieve the candidate sub- ing distance function Dtw. The subsequences whose ac-
sequences whose lower bound time warping distances tual time warping distances from Q are not larger than
from Q are within e. Let us consider function Visit- e are returned to a user as final answers (line 3).
Node-and-FindAnswers-CST shown in Algorithm 2.
When the algorithm visits node N, it inspects each child Algorithm 2. Algorithm for traversing the compact
node to find new candidates and determines if it is nec- subsequence tree VisitNode-and-FindAnswers-CST.
essary to go further down the tree.
For example, suppose that the algorithm stays on
node N and inspects its child node CNi. For simpler
explanation, we assume that each edge is labeled by a
single symbol. The algorithm builds the cumulative dis-
tance table between Q and label(N, CNi), locating Q and
label(N, CNi) on the X-axis and on the Y-axis, respec-
tively. If N is the root node, the table is built from the
bottom. Otherwise, the table is constructed by augment-
ing a new row on the table that has been accumulated
from the root to N. Function AddRow (line 3) adds a
new row using the distance function Dtw-lb with a given
Lp as a base distance function. The first step is to inspect
the last column of the newly added row to find candi-
dates (line 4). If the last column has a value not larger
than e (line 5), then the algorithm extracts the identifiers
from the leaf nodes under CNi and adds them to the
candidate set (line 6).
Note that there may be some data subsequences not
stored in a compact subsequence tree. These subse-
quences are embedded in the paths from the root to
internal nodes. Therefore, the paths to internal nodes
can generate candidate answers. Suppose that the path 5.2. Algorithm analysis
from the root to the node CNi has a lower-bound dis-
tance not larger than e from Q. Then, all the subse- Before analyzing the complexity of Search-CST, let us
quences represented by this path can be easily examine the complexities of Seq-Scan and Search-ST.
identified by extracting the leaf nodes under CNi. Let Seq-Scan is the sequential scan method and Search-ST
LN be one of such leaf nodes. When LN has an identi- is the method using a subsequence tree as an index.
fier (Sk, j, j 0 ) and the length of the path from the root to Seq-Scan reads each data sequence S, performs k-
CNi is l, one of the subsequences identified by this path moving average transformation for S to get MVk(S), ex-
is by (Sk, j, j + l 1). tracts every possible subsequence X from MVk(S), and
The next step is to determine if it is necessary to normalizes every X. Given query sequence Q and subse-
go further down the tree. That is, if at least one col- quence X, it builds the cumulative distance table with
umn of the newly added row has a value not greater the computation complexity of O(jQj jXj) (Berndt and
than e (line 8), the search continues to go down the tree Clifford, 1996). For M k-moving averaged sequences
to find more candidates. Otherwise, the search moves to whose average length is L, there are MLðLþ1Þ subsequences
2
the next child of N. When it is necessary to go further and their average length is Lþ2. Therefore, the computa-
3
down the tree, the algorithm calls itself recursively tion complexity of Seq-Scan is O(ML3jQj).
(line 9). Search-ST is computationally much cheaper than Seq-
Since VisitNode-and-FindAnswers-CST finds candi- Scan due to branch-pruning and sharing of common
dates using the lower-bound distance function Dtw-lb, cumulative distance tables for all the subsequences that
the subsequences whose actual distances are larger than have common prefixes.
e may be contained in the candidate set. Those subse- Thus, the complexity of Search-
ML3 jQj
quences are called false alarms (Faloutsos et al., 1994). ST is O Rd Rp
þ nLjQj . The left term is for tree traversal
Therefore, Search-CST requires the post-processing step and the right term is for post-processing. Here, Rd (P1)
to detect and discard false alarms. For each answer in is the reduction factor due to sharing of the cumulative
the candidate set, function PostProcess (line 2) retrieves distance tables, and Rp (P1) is the reduction factor gained
the corresponding subsequence, transforms it using k- from the branch-pruning. n represents the number of can-
moving average and normalization, and then computes didate subsequences that require the post-processing.
S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 199
The complexity of Search-CST is easily derived from RAM. The software platform is Solaris 7, which was
that of Search-ST. The compact subsequence tree can be set to a single-user mode to minimize the interference
traversed faster than the corresponding subsequence tree from other system and user processes.
because the former is usually smaller than the latter.
However, the subsequence tree has a larger reduction 6.2. Results and analyses
factor Rd due to a higher possibility to share common
prefixes. Let q (0 < q 6 1) be the compression ratio of In Experiment 1, we evaluated the effectiveness and
the compact subsequence tree, q = the number of stored performance of our similarity model with S&P 500 stock
subsequences/the number of total subsequences. Then data set. We selected 1000 sequences of average length
qML3 jQj 100 from the data set and applied the 10-moving average
the complexity of Search-CST is O R0d Rp
þ nLjQj .
transformation. As a query sequence, we used a subse-
quence of double bottom pattern, which is frequently
6. Performance evaluation used in stock data analyses. The values of distance toler-
ance e were determined to retrieve 100, 200, 300 and
This section presents the experimental results for per- 1000 different answers, respectively.
formance evaluation of the proposed method. Section Fig. 4 shows some instances of the double bottom
6.1 describes the environment for experiments and Sec- pattern retrieved by our similarity method. In each fig-
tion 6.2 shows and analyzes experimental results. ure, the dotted line represents an actual sequence in
the data set, and the plain line represents its 10-moving
6.1. Environment averaged sequence. Fig. 4(a) represents the sequence se-
lected for querying. We extracted a double bottom pat-
All the parameter values for our experiments were se- tern from that sequence and used it as a query sequence.
lected for simulating the behavior of real-world stock The remaining figures represent the answers obtained
prices and query patterns issued for investment. Two from the data set. The subsequences represented by bold
kinds of data sets were used for experiments: synthetic lines in Fig. 4(b)–(d) are some answers retrieved by our
data set and real world stock data set. Each synthetic methods that employ L1, L2, and L1, respectively.
data sequence S = h s1, s2, . . ., sni was generated by the We see that all the answers contain the double bot-
following random walk expression: tom pattern whose shape is quite similar to that of the
query sequence even though their actual element values
si ¼ si1 þ zi .
are multifarious. And these answers illustrate the char-
Here, zi is an independent, identically distributed (ID) acteristics of different distance functions well. We believe
random variable that takes values in the range that the choice of appropriate distance functions for
[0.1, 0.1]. The value of the first element s1 was taken similar subsequence matching is highly application
randomly within the range [1, 10]. Stock data sequences dependent and up to application engineers.
were extracted from USA S&P 500 and their element As the proposed similarity model supports moving
values were based on daily closing prices. As mentioned average and time warping transformations, the answers
in Section 5, query sequences were randomly selected retrieved by our methods using L1, L2, and L1 distance
from those obtained by k-moving averaging such data functions have many common subsequences. For exam-
sequences. ple, Fig. 4(e) depicts common answers retrieved by our
We have conducted performance evaluation on the three methods employing L1, L2, and L1 distance func-
three different approaches: Search-CST, Search-ST, and tions. Fig. 5 shows the results of an experiment that
Seq-Scan. Search-CST represents our approach that uses compares the answers retrieved by our methods. The
the compact subsequence tree and utilizes a single index stock data sequences were used as a data set and the dis-
structure for all Lp (p = 1, 2, 1) distance function. tance tolerance e was determined to retrieve 100 different
Search-CST-L1, Search-CST-L2, and Search-CST-L1 rep- answers. The Venn Diagram represents the percentages
resent our approaches that employ L1, L2, and L1 dis- of common answers retrieved by two or three methods.
tance functions, respectively. Search-ST also represents The experimental results with more answers were similar
our approach that uses just the subsequence tree. to those in Fig. 5.
Search-ST-L1, Search-ST-L2, and Search-ST-L1 repre- In Experiment 2, we compared the time and space
sent Search-ST methods that employ L1, L2, and L1, efficiency of the proposed approaches. We selected
respectively. Finally, Seq-Scan is a naive sequential scan 200, 400, 600, 800, and 1000 sequences of average length
method and Seq-Scan-L1, Seq-Scan-L2 and Seq-Scan-L1 100 from the stock data set. To construct the index, we
represent Seq-Scan methods that employ L1, L2, and set the number of categories (C) to 60. As C increases,
L1, respectively. the index size also increases but the query processing
The hardware platform for the experiments is the Sun time decreases. However, when C exceeds a certain
UltraSparc-10 workstation equipped with 512 MB threshold value, the index size and the query processing
200 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203
Fig. 4. Example of shape-based retrieval of similar subsequences: (a) Query sequnce, (b) L1-based, (c) L2-based, (d) L1-based, and (e) L1, L2,
L1-based.
nodes, edges, and leaf nodes of Search-CST are reduced
50% approximately. However, because the edges of the
compact subsequence tree are longer than those of the
corresponding subsequence tree, Search-CST saves
about 36% of storage space actually.
Fig. 6 shows the average query processing time of the
three methods with various values of e. We selected 200
sequences of average length 100 from the stock data set.
The average length of query sequences was set to 20. The
values of e were determined to retrieve 10, 30, 100, and
300 answers, respectively.
Fig. 5. Percentages of common answers retrieved by our methods
using L1, L2, and L1 distance functions.
As expected, the average query processing times of
Search-CST and Search-ST are almost same with any
value of p in the Lp-based distance function. But,
time become nearly constant. We decide this threshold Search-CST performs much better than Seq-Scan. For
value as the optimal number of categories. In our exper- example, when the L1-based time-warping distance is
iment, we chose 60 as the optimal number of categories, used, Search-CST-L1 performs better than Seq-Scan-
and thus set the number of categories for all the follow- L1 about 50–117 times with different values of e.
ing experiments to 60. Search-CST-L2 performs better than Seq-Scan-L2 about
Table 2 shows the index size of the proposed indices. 26–66 times and Search-CST-L1 performs better than
We observe that the size of Search-ST and Search-CST Seq-Scan-L1 about 13–23 times.
increases linearly as the number of sequences increases. As the values of e in L1 and L2 must be larger than
In comparison with Search-ST, the numbers of internal those of L1 to retrieve the same number of answers,
S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 201
Table 2
The size of the proposed indices with the increasing number of sequences
Number of sequences Search algorithm Number of edges Number of internal nodes Number of leaf nodes Index size (KBytes)
200 Search-ST 997,806 997,805 819,000 44,960
Search-CST 537,176 537,175 350,300 28,483
400 Search-ST 1,992,098 1,992,097 1,638,000 90,534
Search-CST 1,098,474 1,098,473 723,721 58,373
600 Search-ST 3,000,515 3,000,514 2,457,000 136,101
Search-CST 1,649,482 1,649,481 1,081,029 87,433
800 Search-ST 3,998,892 3,998,891 3,276,000 181,170
Search-CST 2,191,398 2,191,397 1,435,043 115,903
1000 Search-ST 5,006,675 5,006,674 4,095,000 227,021
Search-CST 2,727,899 2,727,898 1,782,489 144,326
In Experiment 3, we compared the average query
processing time of the two approaches while changing
the average length of query sequences. We selected 200
sequences of average length 100 from the stock data
set. The values of e were determined to retrieve about
100 answers. Fig. 7 shows the average query processing
time of Search-CST and Seq-Scan with the increasing
length of query sequences. The result shows that
Search-CST performs better than Seq-Scan regardless
of query sequence length. As seen from this figure, the
performance improvements of Search-CST are similar
to those in Fig. 6 with any value of p in the Lp-based dis-
tance function.
Fig. 6. Average query processing time with the increasing number of In Experiment 4, we compared the two approaches
answers. with increasing number of sequences and increasing
average length of sequences, respectively. We used a
the query processing times of Search-CST-L1 and Search- large volume synthetic data set for this experiment.
CST-L2 get longer than that of Search-CST-L1. The average length of query sequences was set to 20.
With the results of Experiment 2, we decided that First, we fixed the length of sequences at 100 and in-
Search-CST is better than Search-ST because it solves creased the number of sequences from 2000 to 10,000.
the problem of the index size while preserving the good Fig. 8 shows the query processing time of the two ap-
performance. Therefore, in the following experiments, proaches with various numbers of sequences. The
we only compared Search-CST with the sequential scan elapsed times of both approaches increase linearly as
method.
Fig. 7. Average query processing time with the increasing average Fig. 8. Average query processing time with the increasing number of
length of query sequences. data sequences.
202 S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203
ables users to define target results in querying depending
on their preferences.
Also, we proposed a compressed subsequence tree
and a query processing method for efficient processing
of the shape-based retrieval without false dismissal.
The compressed subsequence tree is a compact version
of a disk-based subsequence tree. An important feature
of our approach is to support our similarity model based
on L1, L2, and L1 with only one index structure.
To verify the superiority of our approach, we per-
formed a series of experiments with a real-world S&P
500 stock data set and large synthetic data sets. The re-
sults reveal that our approach successfully finds all the
subsequences that have the shapes similar to that of
Fig. 9. Average query processing time with the increasing length of the query sequence, and also achieves several ten times
data sequences. to several hundred times speedup compared with the
sequential scan method.
the number of sequences grows. The value of e was
chosen to retrieve about 100 answers from 2000 data se-
Acknowledgments
quences. The result shows that Search-CST shows better
performance than Seq-Scan regardless of the number of
This work has been supported by Korea Research
sequences. As seen from this figure, the performance
Foundation with Grant KRF-2003-041-D00486, the IT
improvements of Search-CST are similar to those in
Research Center via Kangwon National University,
Fig. 6 with any value of p in the Lp-based distance
and the University Research Program (C1-2002-146-0-
function.
3) of IITA. Sang-Wook Kim would like to thank Jung-
Then, we fixed the number of sequences at 100 and
Hee Seo, Suk-Yeon Hwang, Grace (Joo-Young) Kim,
increased the length of sequences from 200 to 1000.
and Joo-Sung Kim for their encouragement and support.
Fig. 9 shows the query processing times of the two ap-
proaches with changing average length of sequences.
The value of e was chosen to retrieve about 100 answers
from data sequences of length 200. As shown in Fig. 9, References
while the elapsed time of Seq-Scan increases rapidly,
Agrawal, R., Faloutsos, C., Swami, A., 1993. Efficient similarity search
that of Search-CST increases quite slowly. For example, in sequence databases. In: Proc. FODO. pp. 69–84.
when the L1-based time-warping distance function is Agrawal, R., Lin, K., Sawhney, H.S., Shim, K., 1995. Fast similarity
used, Search-CST-L1 performs better than Seq-Scan- search in the presence of noise, scaling, and translation in time-
L1 about 102–362 times. Search-CST-L2 performs better series databases. In: Proc. VLDB. pp. 490–501.
than Seq-Scan-L2 about 61 to 390 times and Search- Agrawal, R., Psaila, G., Wimmers, E.L., Zäit, M., 1995. Querying
shapes of histories. In: Proc. VLDB. pp. 502–514.
CST-L1 performs better than Seq-Scan-L1 about 31 to Beckmann, N., Kriegel, H., Schneider, R., Seeger, B., 1990. The R*-
253 times. The performance gain gets larger as the tree: an efficient and robust access method for points and
length of sequence increases. rectangles. In: Proc. ACM SIGMOD. pp. 322–331.
Berndt, D.J., Clifford, J., 1996. Finding patterns in time series: a
dynamic programming approach. In: Advances in Knowledge
Discovery and Data Mining. AAAI/MIT, Cambridge, MA, pp.
7. Conclusions 229–248.
Chatfield, C., 1984. The Analysis of Time-series: an Introduction, third
This paper discussed the problem of shape-based re- ed. Chapman and Hall, London.
trieval in time-series databases. This paper defined a Chen, M.S., Han, J., Yu, P.S., 1996. Data mining: an overview from
new similarity model for shape-based subsequence re- database perspective. IEEE TKDE 8 (6), 866–883.
Chu, K.W., Wong, M.H., 1999. Fast time-series searching with scaling
trieval, and also proposed the indexing and query pro- and shifting. In: Proc. ACM PODS. pp. 237–248.
cessing methods for supporting this similarity model Das, G., Gunopulos, D., Mannila, H., 1997. Finding similar time
efficiently. series. In: Proc. PKDD, pp. 88–100.
The proposed similarity model supports a combina- Faloutsos, C., Ranganathan, M., Manolopoulos, Y., 1994. Fast
tion of transformations such as shifting, scaling, moving subsequence matching in time-series databases. In: Proc. ACM
SIGMOD, pp. 419–429.
average, and time warping, and allows users to choose Goldin, D.Q., Kanellakis, P.C., 1995. On similarity queries for time-
an Lp distance function to computing the similarity be- series data: constraint specification and implementation. In: Proc.
tween the two finally-transformed sequences. Thus, it en- Constraint Programming. pp. 137–153.
S.-W. Kim et al. / The Journal of Systems and Software 79 (2006) 191–203 203
Kendall, M., 1979. Time-series, second ed. Charles Griffin and Perng, C.S., Wang, H., Zhang, S.R., Parker, D.S., 2000. Landmarks: a
Company, London. new model for similarity-based pattern querying in time series
Kim, S.W., Park, S., Chu, W.W., 2001. An index-based approach for databases. In: Proc. IEEE ICDE. pp. 33–42.
similarity search supporting time warping in large sequence Preparata, F.P., Shamos, M., 1985. Computational Geometry: an
databases. In: Proc. IEEE ICDE. pp. 607–614. Introduction. Springer-Verlag, Berlin.
Loh, W.K., Kim, S.W., Whang, K.Y., 2000. Index interpolation: an Rabiner, L., Juang, H.H., 1993. Fundamentals of Speech Recognition.
approach for subsequence matching supporting normalization Prentice Hall, Englewood Cliffs, NJ.
transform in time-series databases. In: Proc. ACM CIKM. pp. Rafiei, D., 1999. On similarity-based queries for time series data. In:
480–487. Proc. IEEE ICDE. pp. 410–417.
Loh, W.K., Kim, S.W., Whang, K.Y., 2001. Index interpolation: a Rafiei, D., Mendelzon, A., 1997. Similarity-based queries for time-
subsequence matching algorithm supporting moving average series data. In: Proc. ACM SIGMOD. pp. 13–24.
transform of arbitrary order in time-series databases. IEICE Shim, K., Srikant, R., Agrawal, R., 1997. High-dimensional similarity
Trans. Inf. Syst. E84-D (1), 76–86. joins. In: Proc. IEEE ICDE, April. pp. 301–311.
Moon, Y.S., Whang, K.Y., Loh, W.K., 2001. Duality-based subse- Sidiropoulos, N.D., Bros, R., 1999. Mathematical programming
quence matching in time-series databases. In: Proc. IEEE ICDE. algorithms for regression-based non-linear filtering in. IEEE Trans.
pp. 263–272. Signal Process. (Mar).
Park, S., Chu, W.W., Yoon, J., Hsu, C., 2000. Efficient searches for Stephen, G.A., 1994. String Searching Algorithms. World Scientific
similar subsequences of different lengths in sequence databases. Publishing, Singapore.
In: Proc. IEEE ICDE. pp. 23–32. Yi, B.K., Faloutsos, C., 2000. Fast time sequence indexing for
Park, S., Kim, S.W., Cho, J.S., Padmanabhan, S., 2001. Prefix- arbitrary Lp norms. In: Proc. VLDB. pp. 385–394.
querying: an approach for effective subsequence matching under Yi, B.-K., Jagadish, H.V., Faloutsos, C., 1998. Efficient retrieval of
time warping in sequence databases. In: Proc. ACM CIKM. pp. similar time sequences under time warping. In: Proc. IEEE ICDE.
255–262. pp. 201–208.