Academia.eduAcademia.edu

Outline

Distance-based indexing for high-dimensional metric spaces

1997, ACM SIGMOD Record

https://doi.org/10.1145/253260.253345

Abstract

In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given quety image. Distance based index structures are proposed for applications where the data domain is high dimensional, or the distance function used to compute distances between data objects is non-Euclidean. In this paper, we introduce a distance based index structure called multi-vantage point (mvp) tree for similarity queries on high-dimensional metric spaces. The mvptree uses more than one vantage point to partition the space into spherical cuts at each level. It also utilizes the pre-computed (at construction time) distances between the data points and the vantage points. We have done experiments to compare mvp-trees with vp-trees which have a similar partitioning strategy, but use only one vantage point at each level, and do not make use of the pre-computed distances. Empirical studies show that mvp tree outperforms the vp-tree 2096 to 80% for varying query ranges and different distance distributions. 1. Irttmduction In many database applications, it is desirable to be able to answer queries based on proximity such as asking for data items that are similar to a query item, or that are closest to a query item. We face such queries in the context of many database applications such as genetics, image/picttrre databases, time series analysis, information retrieval, etc. In genetics, the concern is to find DNA or protein sequences that are similar in a genetic database. In time-series analysis, we would like to find similar patterns among a given collection of sequences. Image databases can be queried to find and retrieve images in the database that are similar to the query image with respect to a specified criteria. *~s rsscarchis partiatlysupportsd by the National Science Foundation grant [RI 92-24660, and the National seiemx t%undadon FAW award IRI-90-24152 Permission to make digital/hard copy of part or all this work for personal or claesroom use is granted without fee providad that copies are not made or distributed for profit or commercial advantage, tha cop~ight notica, tha title of tha publication and its dste appear, and notice ia given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to poet on servara, or to redistribute to Iista, requires prior specific permission and/or a faa SIGMOD '97 AZ, USA

DISTANCE-BASED INDEXING FOR HIGH-DIMENSIONAL METRIC SPACES* Tolga Bozkaya Meral Ozsoyoglu Departmentof Computer Engineering& Science Department of Computer Engineering& Science Case Western Reserve University Case Western Reserve University email: [email protected] .cwru.edu [email protected] .cwru.edu Similarity between images can be measured in a number Abstract of ways. Features such as shape, color, texture can be extracted In many database applications, one of the common queries is to from images in the database to be used as content information find approximate matches to a given query item from a where the distance calculations will be based on. Images can also collection of data items. For example, given an image database, be compared on a pixel by pixel basis by calculating the distance one may want to retrieve all images that are similar to a given between two images as the accumulation of the differences quety image. Distance based index structures are proposed for between the intensities of their pixels. applications where the data domain is high dimensional, or the In all the applications above, the problem is to find distance function used to compute distances between data similar data items to a given query item where the similarity objects is non-Euclidean. In this paper, we introduce a distance between items is computed by some distance function defined on based index structure called multi-vantage point (mvp) tree for the application domain. Our objective is to provide an efficient similarity queries on high-dimensional metric spaces. The mvp- access mechanism to answer these similarity queries. In this tree uses more than one vantage point to partition the space into paper, we consider the applications where the data domain is spherical cuts at each level. It also utilizes the pre-computed (at high dimensioned, and the distance function employed is metric. construction time) distances between the data points and the It is important for an application to have a metric distance vantage points. We have done experiments to compare mvp-trees function to make it possible to do filtering of distant data items with vp-trees which have a similar partitioning strategy, but use for a similarity query by using the triangle inequality property only one vantage point at each level, and do not make use of (section 2). Because of the high dimensionality, the distance the pre-computed distances. Empirical studies show that mvp calculations between data items are assumed to be very tree outperforms the vp-tree 2096 to 80% for varying query expensive. Therefore, an efficient access mechanism should ranges and different distance distributions. certainly have to minimize the number of distance calculations for similarity queries to improve the speed in answering them. 1. Irttmduction This is usually done by employing techniques and index structures that are used to filter out distant (non-similar) data In many database applications, it is desirable to be able items quickly, avoiding expensive distance computations for each to answer queries based on proximity such as asking for data of them. items that are similar to a query item, or that are closest to a query item. We face such queries in the context of many database The data items that are in the result of a similarity query applications such as genetics, image/picttrre databases, time can be further filtered out by the user through visual browsing. series analysis, information retrieval, etc. In genetics, the concern This happens in image database applications where the user is to find DNA or protein sequences that are similar in a genetic would pick the most semantically related images to a query database. In time-series analysis, we would like to find similar image by examining the images retrieved as the result of a patterns among a given collection of sequences. Image databases similarity query. This is mostly inevitable because it is can be queried to find and retrieve images in the database that impossible to extract and represent all the semantic information are similar to the query image with respect to a specified criteria. for an image simply by extracting features in the image. The best art image database can do is to present the images that are related or close to the query image, and leave the further identification * ~s rsscarchis partiatlysupportsd by the National Science Foundation and semantic interpretation of images to users. grant [RI 92-24660, and the National seiemx t%undadon FAW award IRI- 90-24152 In this paper, we introduce the mvp-tree (multi-vantage Permission to make digital/hard copy of part or all this work for point tree) as a general solution to the problem of answering personal or claesroom use is granted without fee providad that similarity based queries efficiently for high-dimensional metric copies are not made or distributed for profit or commercial advan- spaces. The mvp-tree is similar to the vp-tree (vantage point tree) tage, tha cop~ight notica, tha title of tha publication and its dste [Uh191] in the sense that both structures use relative distances appear, and notice ia given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to poet on servara, or to from a vantage point to partition the domain space. In vp-trees, at redistribute to Iista, requires prior specific permission and/or a faa every node of the tree, a vantage point is chosen among the data SIGMOD ’97 AZ, USA 01997 ACM 0-89791-911-419710005. ..$3.50 357 points, and the distances of this vantage point from all other A metric distance function d(.r,y) for a metric space is points (the points that will be indexed below that node) are defined as follows: computed. Then, these points are sorted into an ordered list with i) d(x,y) = d(y,-r) respect to their distances from the vantage point. Next, the list is ii) O < d(x,y) < ~, x# y partitioned at positions to create sublists of equal cardinality. iii) d(x,x) = O The order of the tree corresponds to the number of partitions to be made. Each of these partitions keep the data points that fall iv) d(x,v) S d(x,z) + d(z,y) (triangle inequality) into a spherical cut with inner and outer radii being the minimum The above conditions are the only ones we should be and the maximum distances of these points from the vantage assuming when designing an index structure based on distances point. between objects in a metric space. Note that, we cannot make use The mvp-tree behaves more cleverly in making use of the of any geometric information about the metric space, unlike the vantage-points by employing more than one at each level of the way we can for a Euclidean space. We only have a set of objects tree to increase tbe fanout of each node of the tree. In vp-trees, from a metric space, and a distance function do that can be used for a given similarity query, most of the distance computations to compute the distance between any two objects. made are between the query point and the vantage points. Similarity based queries can be posed in a number of Because of using more than one vantage points in a node, the ways. The most common one asks for all data objects that are mvp-tree has less vantage points compared to a vp-tree. The within some specified distance from a given query object. These distances of data points at the leaf nodes from the vantage points queries require retrieval of near neighbors of thequery objecc at higher levels (which were already computed at construction time) are kept in mvp-trees, and these distances are used for Near Neighbor Query: From a given set of data objects X efficient filtering at search time. The efficient filtering at the leaf = {Xl, Xz, ..,, X.) from a metric space with a metric distance level is utilized more by making the leaf nodes to have higher function do, retrieve all data objects that are within distance r of node capacities. By this way, the major filtering step during a given query point Y. The resulting set will be { Xi I Xi ● X and search is delayed to the leaf level. dXi,Y~ r ]. Here, r is generally referred to as the similarity measure, or tbe tolerance factor. We have done experiments with 20-dimensional Euclidean vectors and gray-level images to compare vp-trees and Some variations of the near neighbor query are also mvp-trees to demonstrate mvp-trees’ efficiency. The distance possible. The nearest neighbor query asks for the closest object distribution of data points plays an important role in the to a given query object. Similarly, k closest objects may be efficiency of the index structures, so we experimented on two requested as well. Though not very common, objects that are sets of Euclidean vectors with different distance distributions. In farther than a given range from a query object can also be asked both cases, mvp-trees made 20% to 80% less number of distance as well as the farthest, or the k farthest objects from the query computations compared to vp-trees for small query ranges. For object. The formulation of all these queries are similar to the higher query ranges, the percentagewise difference decreased near neighbor query we have given above. gradually, yet the mvp-trees performed 10% to a respectable 30% Here, we are mainly concerned on distance based less distance computations for the largest query ranges we used indexing for high-dimensional metric spaces. We also in our experiments. concentrate on tbe near neighbor queries when we introduce our Our experiments on gray-level images using LI and L2 index structure. Our main objective is to minimize the number of metrics (see section 5.1 ) afso revealed the fact that mvp-trees distance calculations for a given similarity query as we assume perform better than vp-trees. For this data set, we had only 1151 that distance computations in high-dimensional metric spaces are images to experiment on (and therefore bad rather shallow trees), very expensive. In the next section, we discuss the indexing and the mvp-trees performed upto 20-30% less distance problem for high-dimensional metric spaces, and review previous computations. approaches to the problem. Therest of thepaper is organized as follows. Sextion 2 3. Indexing in High-Dimensional Spaces gives the definitions for high dimensional metric spaces and similarity queries. Section 3 presents the problem of indexing in For low-dimensional Euclidean domains, the high dimensional spaces and also presents previous approaches conventional index structures ([ Sam89]) such as R-trees (and its to this problem. The related work for distance-based index variations) [Gut84, SRF87, BKSS90] can be used effectively to structures to answer similarity based queries is also given in answer similarity queries. In such cases, a near neighbor search section 3. Section 4 introduces the mvp-tree structure. The query would ask for all the objects in (or that intersects) a experimental results for comparing the mvp-trees with vp-trees spherical search window where the center is the query object and are given in section 5, We summarize our results and point out the radius is the tolerance factor r. There are some special future research directions in section 6. techniques for other forms of similarity queries, such as nearest neighbor queries. For example, in [RKV95], some heuristics are 2. Metric Spaces and Similarity Queries introduced to efficiently search the R-tree structure to answer nearest neighbor queries. However, the conventional spatial In this section, we briefly give the definitions for metric structures stop being efficient if the dimensionality is high. distance functions and different types of similarity queries. Experimental results [Ott92] show that R-trees become inefficient for n-dimensional spaces where n is greater than 20. 358 The problem of indexing high-dimensional spaces can be domains as the conventional distance functions (such as approached in different ways. One approach is to use distance Euclidean, or any LP distance)used for these domains are metric. preserving transformations to Euclidean spaces, which we Sequence matching, time-series analysis, image databases are discuss in section 3.1. Another approach is using distance-based some example applications having such domains. Distance based index structures. In section 3.2, we discuss distance-based index techniques are also applicable for domains where the data is non- structures, and briefly review the previous work. In section 3.3, spatiaI (that is, data objects can not be mapped to points in a we discuss the vp-tree structure in detail since it is the most multi-dimensional space), such as in the case of text databases relevant approach to work. which generafly use the edit distance (which is metric) for computing similarity data items (lines of text, words, etc.). We 3.1 Distance Preserving Transformations review a few of the distance based indexing techniques below. There are ways to use conventional spatial structures for 3.2 Distance-Based Index Structures high-dimensional domains. One way is to apply a mapping of objects from a high-dimensional space to a low-dimensional There are a number of research results on efficiently (Euclidean) space by using a distance preserving transformation, answering similarity search queries in different contexts. In and then using conventional index structures (such as R-trees) as [BK73], Burkhard & Keller suggested the use of three different a major filtering mechanism. A distance preserving techniques for the problem of finding best matching (closest) key transformation is a mapping from a high-dimensional domain to words in a tile to a given query key. They employ a metric a lower-dimensional domain where the distances between distance function on the key space which afways returns discrete objects before the transformation (in the actual space) are greater values, (i.e., the distances are always integers). Their first than or equal to the distances after the transformation (in the method is a hierarchical multi-way tree decomposition. At the transformed space). That is, the distance preserving functions top level, they pick an arbitrary element from the key domain, underestimate tbe actual distances between objects in the and group the rest of the keys with respect to their distances to transformed space. Distance preserving transformations have that key. The keys that are of the same distance from that key get been successfully used to index high-dimensional data in many into the same group. Note that this is possible since the distance applications such as time sequences [AFA93, FRM94], and values are always discrete. The same hierarchical composition images [FEF+94]. goes on for all the groups recursively, creating a tree structure. The distance preserving functions such as DIT, In the second approach in [BK73], they partition the Karbunen-1-.oeve are applicable to any Euclidean domain. Yet, it space into a number of sets of keys. For each set, they arbitrarily is also possible to come up with application specific dkmce pick a center key, and cafculate the radius which is the preserving transformations for the same purpose. In the QBIC maximum distance between the cenfer and any other key in the (Query By Image Content) system [FEF+94], color content of set. The keys in a set are partitioned into other sets recursively images can be used to answer similarity queries. The difference creating a multi-way tree. Each node in the tree, keeps the of the color contents of two images are computed from their color centers and the radii for the sets of keys indexed below. The histograms. Computation of a distance between the color strategy for partitioning the keys into sets was not discussed and histograms of two images is quite expensive as the color was left as a parameter. histograms are high-dimensional (number of different colors is The third approach of [BK73] is similar to the second generally 64 or 256) vectors, and also cross?alk (as some colors are simikar) between colors have to be considered. To increase one, but there is the requirement that the diameter (the maximum distance between any two points in a group) of any group should speed in color distance computation, the QBIC keeps an index on average color of images. The average color of an image is a 3- be less than a given constant k, where the value of k is different dimensional vector with the average red, blue, and green values at each level. The group satisfying this criterion is called a of the pixels in the image. The distance between average color clique. This method relies on finding the set of maximal cliques vectors of images are proven to be less than or equal to the at each level, and keeping their representatives in the nodes to distance between their color histograms, that is, the direct or trim the search. Note that keys may appear in more than transformation is distance preserving. Similarity queries on color one clique, so the aim is to select the representative keys to be content of images are answered by first using the index on the ones that appear in as many cliques as possible. average color vectors as the major filtering step, and then In another approach, such as the one in [SW90], pre- refining the result by actual computations of histogram distances. computed distances between the data elements are used to Note that, although the idea of using distance preserving efficient y answer similarity search queries. The aim is to transformation works fine for many applications, it makes the minimize the number of distance computations as much as assumption that such a transformation exists and applicable to possible, as they are assumed to be very expensive. Search the application domain. Transformations such as DFT or afgoritbms of O(n) or even O(n log n) (where n is the number of Karhunen-Loeve are not effective in indexing high-dimensional data objects) are acceptable if they minimize the number distance vectors where the values at each dimension are uncorrelated for computations. In [SW90], a table of size 0(n2) keeps the any given vector. Therefore, unfortunately, it is not always distances between data objects if they are pre-computed. The possible or cost effective to employ this method. Yet, there are other pairwise distances are estimated (by specifying an interval) distance based indexing techniques that are applicable to all by making use of the other pre-computed distances. The domains where metric distance functions are employed. These technique of storing and using pre-computed distances may be techniques can be directly employed for high-dimensional spatial effective for data domains with small cardinality, however, the 359 space requirements and the search complexity becomes (from S“) indexed below that node, and RD,,and L+,, all the points overwhelming for larger domti,ns. are pokters to the right and left branches. Left branch of the node indexes the points whose distances from S, are less than or In [Uh191 ], Uhlmann introduced two hierarchical index structures for similarity search. The first one is the vp-tree equal to M, and right branch of the node indexes the points (vantage-point tree). The vp-tree basically partitions the data whose distances from S, are greater than or equal to M. In leaf space into spherical cuts around a chosen vantage point at each nodes, instead of the pointers to the left and right branches, level. This approach, referred to as the ball decomposition in the references to the data points are kept. paper is similar to the first method presented in [BK73]. At each Given a finite set S={S1, S2, .. . S“) of n objects, and a node, the distances between the vantage point for that node and metric distance function d(S the data points to be indexed below that node are computed. The median is found, and the data points are partitioned into two groups, one of them accommodating the points whose distances to the vantage point are less than or equal to the median distance, and the other group accommodating the points whose distances are larger than or equal to the median. These two groups of data points are indexed separately by the left and right subbranches below that node, which are constructed in the same way recursively. Although the vp-tree was introduced as a binary tree, it is also possible to generalize it to a multi-way tree for larger fanouts. In [Yia93], the vp-tree structure was enhanced by an algorithm to pick vantage-points for better decompositions. In [Chi94] the vp-tree structure is modified to answer nearest neighbor queries. We talk about the vp-trees in detail in section 3.3. The gh-tree (generalized hyperplane tree) structure was also introduced in [Uh191 ]. It is constructed as follows. At the top node, two points are picked and the remaining points are divided (Q, S,) <r, then S, (the vantage point at the root) into two groups depending on which of these two points they are is in the answer set. closer to. The two branches for the two groups are built 2) If d(Q, S,) + r 2 M (median), then recursively search recursively in the same way. Unlike the vp-trees, the branching the right branch factor can only be two. If the two pivot points are well-selected at 3) If d(Q, S,) - r < M, then recursively search the left every level, the gh-tree tends to be a well-balanced structure. branch. (note that both branches can be searched if both search More recently, Bnn introduced the GNAT (Geometric conditions are satisfied) Near-Neighbor Access Tree) structure [Bri95]. A k number of split points are chosen at the top level. Each one of the remaining The correctness of this simple search strategy can be points are associated with one of the k datasets (one for each split proven easily by using the triangle inequality of distances among point), depending on which split point they are closest to. For any three objects in a metric data space (see Appendix). each split point, the minimum and maximum distances from the points in the datasets of other split points are recorded. The tree Generalizing binary vp-treesinto multi-way vp-trees. is recursively built for each dataset at the next level. The The binary vp-tree can be easily generalized into a muki- number of split points, k, is parametrized and is chosen to be a way tree structurefor larger frmouts at every node hoping that the different value for each data set depending on its cardinality. decrease in the height of the tree would also decrease the number The GNAT structure is compared to the binary vp-tree, and it is of distance computations. The construction of a vp-tree of order shown that the preprocessing (construction) step of GNAT is m is very similar to that of a binary vp-tree. Here, instead of more expensive than the vp-tree, but its search algorithm makes finding the median of the distances between the vantage point less number of distance computations in the experiments for and the data points, the points are ordered with respect to their different data sets. distances from the vantage point, and partitioned into m groups of equal cardinality. The distance values used to partition the 3.3 Vantage point tree structure data points are recorded in each node. We will refer to those values as cutofl values. There are m-1 cutoff values in a node. Let us briefly discuss the vp-trees to explain the idea of The m groups of data points are indexed below the root node by partitioning the data space around selected points (vantage its m children, which are themselves vp-trees of order m created points) at different levels forming a hierarchical tree structure in the same way recursively. The construction of an m-way vp- and using it for effective filtering in similarity search queries. tree requires O(n log~ n) distance computations. That is, creating an m-way vp-tree decreases the number of distance computations The structure of a binary vp-tree is very simple. Each by a factor of logz m compared to binary vp-trees at the internal node is of the form (S,, M, RPt,, ~t,), where S, is the construction stage. vantage point, M is the median distance among the distances of 360 leaf nodes for effective filtering of non qualifying points in a similarity search operation. 4.1 Motivation Before we introduce the mvp-tree, we first discuss a few useful observations that can be used as heuristics for a better search structure. The idea is to partition the data space around a vantage point at each level for a hierarchical search. CMservatiors 1: It is possible to partition a spherical shell-like region using a vantage point chosen from outside the region. This is shown in Figure 2, where a vantage point outside of the region is used to partition it into three parts, which are labeled as 1,2,3 and shaded differently (region 2 consists of two Figure 1. The root level partitioning of a vp-tree with disjoint parts). The vantage point does not have to be from inside branching factor 3. The three different regions are the region, unlike the strategy followed in vp-trees. labelled 1,2,3, and they are all shaded differently. However, there is one problem with high-order vp-trees when the order is large. The vp-tree partitions the data space into spherical cuts (see Figure 1). Those spherical cuts become too thin for high-dimensionaf domains, leading the search regions to intersect with many of them, and therefore leading to more branching in doing similarity searches. As an example, consider an Ndimensional Eucliderm Space where N is a large number, and a vp-tree of order 3 is built to index the uniformly distributed van po n a n e data points in that space. At the root level, the N-dimensional space is partitioned into three spherical regions, as shown in Figure 2. Partitioninga spherical shell-like Figure 1. The three different regions are colored differently and region using a vantage point from outside. labeled as 1,2, and 3. Let RI be the radius of region 1, and Rz be the radius of the sphere enclosing regions 1 and 2. Because of the This means that we can use the same vantage point to uniform distribution assumption, we can consider the N- partition the regions associated with the nodes at the same level. dimensional volumes of regions 1 and 2 to be equal. The volume When the search operation descends down to several branches, of an Ndimensional sphere is directly proportional to the N~h we do not have to make a different distance computation at the factor of its radius, so we can deduce that Rz = RI * (2)m . The root of each branch. Also, if we can use the same vantage point thickness of the spherical shell of region 2 is R2 -R] = RI *( 21m for afl the children of a node, we can as well keep that vantage - 1). To give an idea, for N=100, R2 = 1.007 RI. point in the parent. This way, we would be keeping more than So, when the spherical cuts are very thin, the chances of one vantage point in the parent node. We can avoid creating the a search operation descending down to more than one branch children nodes by incorporating them in the parent. This could be becomes higher. If a search path descends down to k out of m done by increasing the fanout of the parent node. The mvp-tree children of a node, then k distance computations are needed at takes this approach, and uses more than one vantage points in the the next level, where the distance between the query point and nodes for higher utilization. the vantage point of each child node has to he found. This is because the vp-tree keeps a different vantage point for each node Observation 2: In the construction of the vp-tree at the same level. Each child of a node is associated with a structure, for each data point in the leaves, we compute the region that is like a spherical shell (other than the innermost distances between that point and all the vantage points on the child, which has a spherical region), and the data points indexed path from the root node to the leaf node that keeps that data below that child node all belong to that region. Those regions are point. So for each data point, (l% n) distance computations (for disjoint for the siblings. As the vantage point for a node has to be a vp-tree of order m) are made, which is equrd to the height of chosen among the data points indexed below a node, the vantage the tree. In vp-trees, such distances (other than the distance to points of the siblings are all different. the vantage point of the leaf node) are not kept,. However, if is possible to keep these distances for the data points in the leaf 4. Multi-vantage-point trees nodes to provide fimther jiltering at the leaf level during search operations. We use this idea in mvp-trees. In mvp-trees, for each In this section, we present the mvp-tree (multi vantage data point in a leaf, we also keep the first p distances (here, p is point tree). Similar to the vp-tree, the mvp-tree partitions the a parameter) that are computed in the construction step between data space into spherical cuts around vantage points. However, it that data point and the vantage points at the upper levels of the creates partitions with respect to more than one vantage point at tree. The search algorithm is modified to make use of these one level and keeps extra information for the data points in the distances. 361 Having shown the motivation behind the mvp-tree second vantage points respecti~,ely, where k is the tinout for the structure, we explain the construction and search algorithms leaf nodes which may be chosen larger than the fanout of the below. internal nodes m2. For each data point x in the leaves, the array x. PATH@] 4.2 mvp-tree structure keeps the pre-computed distances between the data point x and The mvp-tree uses two vantage points in every node. the first p vantage points along the path from the root to the leaf Each node of the mvp-tree can be viewed as two levels of a node that keeps x. The parameter p can not be bigger than tbe vantage point tree (a parent node and all its children) where all maximum number of vantage points along a path from the root to the children nodes at the lower level use the same vantage point. any leaf node. Figure 3 below shows the structure of internal and This makes it possible for an mvp-tree node to have large leaf nodes of a binary mvp-tree, fanouts, and a less number of vantage points in the non-leaf Having given the explanation for the parameters and the levels. structure, we present the construction algorithm next. Note that, In this section, we will show the structure of mvp-trees we took m=2 for simplicity in presenting the algorithm and present the construction algorithm for binary mvp-trees. In Construction of mvp-trees general, an mvp-tree has 3 parameters: Given a finite set S={SI, SZ. .. . S.) of n objects, and a ● the number of partitions created by each vantage point metric distance function d(Si, Sj), an mvp-tree with parameters m=2, k, and p is constructed on S as follows. (m), ● the maximum fanout for the leaf nodes (k), (Here, we use the notation we have explained above. The ● and the number of distances for the data points at the variable level is used to keep track of the number of vantage leaves to be kept (p). points used along the path from the current node to the root. It is In binary mvp-trees, the first vantage point (we will refer initialized to 1.) to it by S.1) divides the space into two parts, and the second 1) If I S I= O, then create an empty tree and quit, vantage point (we will refer to it by SW) divides each of these partitions into two. So the fanout of a node in a binary mvp-tree 2) If [ S Is k+2, then is four. In general, the fanout of an internal node is denoted by 2.1 ) Select an arbitrary object from S. (S, I is the first the parameter nrz, where m is the number of partitions created by vantage point) a vantage point. The first vantage point creates m partitions, and 2.2) Let S := S - ( S, I ) (Delete S,1 from S) the second point creates m partitions from each of these 2.3) Calculate all d(Si, S,I) where S, = S, and store in partitions created by the first vantage point, making the fanout of may DI. the node m2, 2.4) Let S,2 be the farthest point from S, I in S.(S,2 is the second vantage point) In every internal node, we keep the median, MI, for the 2.5) Let S := S - ( S,2 ] (Delete S,2 from S) partition with respect to the first vantage point, and medians, 2.6) Calculate all d(Sj, S,2) where Sj E S, and store in MJ[ 1] and Mz[2], for the further partitions with respect to the array Dz, second vantage point. 2.7) Quit. 3) Else if \ S I> k+2, then Svl MI 3,1 ) Let S,l be an arbitrary object from S. (S, I is the SV2 IMz[l]I IMz[2]I first vantage point) 3,2) Let S := S - { S, I ) (Delete S, I from S) 1 1 1 1 3.3) Calculate all d(Si, S,I) where Si E S { child pointers ) if (level S p) Si.PATH[l] = d(Si, S,l). 3.4) Order the objects in S with respect to their Internal node distances from S,I. MI= median of { d(Si, S,I) I VSi e S) Break this list S,l D,[l] DI[2] ... D][k] into 2 lists of equal cardinality at the median. Let SS1 and SS2 these two sets in order, i.e., SS2 keeps the SV2 Dzfll Dz[21 ... Dz[kl farthest objects from S,!. PI , P2 , pk , 3.5) Let S,2 be an arbitrary object from SS2. (S,2 is P. PATH P. PATH P.. PATH the second vantage point) 3.6) Let sS2:= SS2 - { S,2 } (Delete S,z from SS~) 3.7) Calculate all d(SJ, S,2) where Sj E SSI or Sj = Leaf node SS2. (PI thru P~ are the data points) if (level < p) Sj.PATH[fevef+ 1] = d(Sj, SV2) Figure 3. Node structure for a binary mvp-tree. 3.8) M2[1]= median of { d(Sj, S,2) I VSj = SS1} M2[2]= median of { d(Sj, S“2) I VSj E 552) In the leaf nodes, we keep the exact distances between 3.9) Break the list SS I into two sets of equal the data points in the leaf and the vantage points of that leaf. cardinality at Mz[ 1]. Dl[i] and D2[i] (i=], 2, ,. k) are the distances from the first and 362 Similarly, break SS2 into two sets of’equal if d(Q. S,l) s r then S,1 is in the answer set. cardinali(y at Mz[2]. if d(Q, S,2) < r then S,2 is in the answer set. Let levef=fevef+2, and recursively create the 2) if the current node is a leaf node, mvp-trees on these four sets. For all data points (Si) in the node, The mvp-tree construction can be modified easily so that 2.1 ) Find d(S,, S,[) and d(Si, S,2) from the arrays DI and D2 respectively, more than 2 vantage points can be kept in one node. Also, higher 2.2) if [d(Q, S,I) - r s d(S,, S,1) S d(Q, S,1) + r] and fanouts at the internal nodes are also possible, and may be more favorable in most cases. [d(Q, S,2) - r < d(Si, S,2) < d(Q, S,2) + r] , then Observe that, we chose the second vantage point to be if foralli=l .. p one of the farthest points from the first vantage point. If the two ( PATH[i] - r < S,.PATH[i] < PATH[i] + r ) vantage points were close to each other, they would not be able holds, to effectively partition the dataset. Actually, the farthest point then compute d(Q, Si). If d(Q, Si) S r, then Si may very well be the best candidate for the second vantage point. is in the answer set. That is why we chose the second vantage point in a leaf node to be the farthest point from the first vantage point of that leaf node. 3) Else if the current node is an internal node Note that any optimization technique (such as a heuristic to chose 3.1) if (1s p) PATH[l] = d(Q, S,1), the best vantage point) for vp-trees can also be applied to the if (l<p) PATH[l+l ] = d(Q, S,2). mvp-trees. 3.2) if d(Q, S,1) + r < Ml, then if d(Q, Sti) + rs M2[1] then recursively The construction step requires O(n log~ n) distance search the first branch with 1=1+2 computations for the mvp-tree. There is an extra storage if d(Q, S,2) - r > M2[ 1] then recursively requirement for the mvp-trees as we keep p distances for each search the second branch with 1=1+2 data point in the leaf nodes, however it does not change the order 3.3) if d(Q, S,l) - r > Ml, then of storage complexity. if d(Q, SV2)+ rs M2[2] then recursively A full mvp-tree with parameters (trr,k,p) and height h has search the third branch with 1=1+2 2*(mZh - 1)/( mz -1) vantage points. That is actually twice the if d(Q, S,l) - r 2 M2[2] then recursively number of nodes in the mvp-tree as we keep two vantage points search the fourth branch with 1=1+2 at every node. The number of data points that are not used as vantage points is (m ‘2(h-lj*k, which is the number of leaf nodes The efficiency of the search algorithm very much depends on the distribution of distances among the data points, times the capacity (k) of the leaf nodes. query range, and the selection of vantage points, In the worst It is a good idea to keep k large so that most of the data case, most data points are relatively far away from each other items are kept in the leaves. If k is kept large the ratio of the (such as randomly generated vectors in a high- dimensional number of vantage points versus the number of points in the leaf domain as in section 5). The search algorithm, in this case, can nodes becomes smaller, meaning that most of the data points are make O(N) (N is the cardinality of the dataset) distance accommodated in the leaf nodes. This makes it possible to filter computations, However, even in the worst case, the number of out many non-qualifying (out of the search region) points from distance computations made by the search algorithm is far less further consideration by making use of the p pre-computed than N, making it a significant improvement over linear search. distances for each leaf point. In other words, instead of making Note that, the claim on worst case complexity is true for all many distance computations with the vantage points in the distance based index structures simply because all of them use internal nodes, we delay the major filtering step of the search the triangle inequality to filter out data points that are distant algorithm to the leaf level where we have more effective means from the query point. of avoiding unnecessary distance computations. In the next section, we present some experiments to’ study the performance of mvp-trees. 4.3 Search algorithm for mvp-trees We present the search algorithm below. Note that the 5. Implementation search algorithm proceeds depth-first for mvp-trees. We need to keep the distances between the query object and the first p We have implemented the main memory model of the vantage points along the current search path as we will be using mvp-trees with different parameters to test and compare it with these distances for eliminating data points in the leaves from the vp-trees. The mvp-tree and the vp-trees are both further consideration (if possible), An amay, PATH[], of size p, is implemented in C under UNIX operating system. Since the used to keep these distances, distance computations are very costly for higb-dimensional metric spaces, we use the number of distance computations as Similarity Search irs mvp-trees the cost measure. We counted the number of distance For a given query object Q, the set of data objects that computations required for similarity search queries by both mvp are within distance r of Q are found using the search algorithm and vp-trees for tire same set of queries for comparison. as follows: 1) Compute the distances d(Q, S,]) and d(Q, S,2). (S,, and S,2 are first and second vantage points) 363 5.1 Data Sets Distancedistribution hi~gram for Two types of data, high-dimensional Euclidean vectors clustered veetors and gray-level MRI images (where each image has 256*256 25000000- pixels) are used for empirical study. 20000000- A. High-Dimensional Euclidean Vectors: 15000000- We used two sets of 50.000 20-dimensional vectors as IJoooooo data sets. Euclidean distance metric is used as the distance metric in both cases. For the first set, all vectors are chosen randomly from a 20-dimensional hypercube with each side of size 1. Each of these vectors is simply generated by randomly “8RGe3Fgt#G$91E&8 choosing 20 real numbers from the interval [0,1]. The pairwise O+ Nti@$*+lri @&timrn distance distribution of these randomly chosen vectors are shown Dts$aneeValue as a histogram in Figure 4. The distance values are sampled at intervals of length 0.01. F@sre 5. Distancedistribution for EucIidean vectors generated in clusters, Note that this data set is highly synthetic. As the vectors (Y axis shows the numberof data object pairs that have the are uniformly distributed, they are mostly far away from each corresponding distance vatue) other. Their distance distribution is similar to a sharp Gaussian (The distance values are sampled at intervats of length 0.01) curve where the distances between any two points fall mostly within the interval [1, 2.5] concentrating around the midpoint 1.75. As a result, the vantage points (in both vp-trees and mvp- Since most of the points are generated from previously trees) afways partition the space into thin spherical shells and generated points, the accumulation of differences may become there is afways a large, void spherical region in the center that large, and therefore, there are many points that are distant from does not accommodate atry data points. This distribution makes the seed of the cluster (and from each other), and many are both structures (or any other hierarchical method) ineffective in outside of the hypercube of side 1. We call these groups of points queries having vafues of r (similarity measure) larger than 0.5, as clusters because of the way they are generated, not because they are a bunch of points that are very close in the Euclidean although higher r values are quite reasonable for legitimate space. In Figure 5, we see the distance distribution histogram for similarity queries. a set of clustered data where each chrster is of size 1000, and e is 0.15. Again the distance values are sampled at intervals of size OMsmee dletrlbuttonhletogrem for 0.01. One can quickly realize that this data set has a different randomly ganereted veetore distance distribution where the possible pairwise distances have 2s00cGo0 1 a wider range. The distribution is not as sharp as it was for random vectors. For this data set, we tested similarity queries with r ranging from 0.2 to 1.0, B. Gray-Level MRI Images: We also experimented on 1151 MRI images with 256*256 pixels and 256 values of graylevel. These images are a collection of MRI head scans of several people. Since we do not have any content information on these images, we simply used LI DietenoeValue and LZ metrics to compute the distances between images. Remember that the ~ distance between any two N-dimensional Figure 4. Distance distribution for randomly Euclidean vectors X and Y (denoted by DP(X,Y) ) is calculated generated Euclidean vectors. as follows: (Y axis shows the number of data object paits that have the corresoondisw distance vatue) (The distance k sa&led atinterv~ of length 0.01) vatues L2 metric is the Euclidean distance metric, An L] distance The second set of Euclidean vectors are generated in between two vectors is simply found by accumulating absolute clusters of equal size. The clusters are generated- as follows. differences at each dimension. First, a random vector is generated from the hypercube with each When calculating distances, we simply treat these images side of size 1. This random vector becomes the seed for the as 256*256=65536-dimensional Euclidean vectors, and cluster. ‘I%en, the other vectors in the cluster are generated from accumulate the pixel by pixel intensity differences using LI or LI this vector or a previously generated vector in the same cluster metrics. This data set is a good example where it is very simply by altering each dimension of that vector with the desirable to decrease the number of distance computations by addition of a random value chosen from the interval [-G F,],where using an index structure. The distance computations not only E is a small constant (such as between 0.1 to 0.2). require a large number of arithmetic operations, but also require 364 considerable l/O time since those images are kept on secondary also be used in a weighted fashion pixel position where each storage using K per image (images are in binary PGM around 61 would be assigned a weight thatused to multiply would be format using one byte per pixel). intensity differences of two images at that pixel position when computing the distances. Such a distance function can be easily We see the distance distributions of the MR1 images for shown to be metric. It can be used to give more importance to L! and L2 metrics in the two histograms shown below in Figures particular regions (for example: center of the images) in 6 & 7. There are (1150*1151 )/2 = 658795 different pairs of computing distances. images and hence, that many computations. The L] distance values are normalized by 10000 to avoid large values in all For gray level images, color histograms can be used to distance calculations between images. The Lz distance values are compute similarity. Unlike color images, there is no cross talk normalized by 100 similarly. After the normalization, the (between the colors) in graylevel (or any mono-color) images, distance values are sampled at intervals of length 1 in each case. and therefore, an ~ metric can be used to compute distances between color histograms. The histograms will simply be treated as if they are 256-dimensional vectors, and then, an LP metric can Die!anee dl@ribution with re~ct to be used. L1 metric 5ooo - 5.2 Experimental Results 4otx - A. High-Dimensional Euclidean Vectors: 30CQ In Figures 8 and 9, we present the search performances 2000- of four tree structures for two different data sets of Euclidean 1000- vectors. The vp-trees of order 2 and 3, and two mvp-trees with 0 the (m,k,p) values (3,9,5) and (3,80,5) respectively are the four 1 64 167250333416499562665746 structures. We have experimented with vp-trees of higher order, however higher order vp-trees gave similar or worse dietence veluee (divided by 10000) performances, therefore, we do not present the results for them. We have also tried several mvp-trees with different parameters, Figure 6. Distance histogram for images when however, we have observed that order 3 (m) gives the most .1 metric is used. reasonable results compared to order 2 or any value higher than 3. We kept 5 (p) reference points for each data point in the leaf nodes of the mvp-trees. The two mvp-trees that we display the Di@ence dkiribution with re~et to results for have different k (leaf capacity) values to see how it L2 metric effects the search efficiency. All the results are obtained by taking the average of 4 different runs for each structure where a 10000- different seed (for the random function used to pick vantage 60CK- points) is used in each run. The result of each run is obtained by 6ooo - averaging the results of 100 search queries with randomly selected query objects from the 20-dimensional hypercube with 4@X- each side of size 1. In Figures 8 and 9, the mvp-tree with (rn,k,p) 2000- vrdues (3,9,5) is referred as mvpt(3,9) and the other mvp-tree is 0 referred as mvpt(3,80) since both trees have the same p values. 1 44 67 130173216259302345366 The vp-trees of order 2 and 3 are referred as vpt(2) artd vpt(3) dietence veluee (divided by 106) respectively. As shown in Figure 8, both mvp-trees perform much Figure 7. Distance histogram for images when better than the vp-trees, and vpt(2) is slightly better than L2 metric is used. (around 10%) vpt(3). mvpt(3,9) makes around 40% less number of distance computations compared to the vpt(2). The gap closes slowly when the query range increases, where mvpt(3,9) makes The distance distribution for the images is much different 20% less distance computations for the query range of 0.5. than the one for Euclidean vectors. There are two peaks, mvpt(3,80) performs much better, and needs around 80% to 65’70 indicating that while most of the images are distant from each percent less number of distance calculations compared to vpt(2) other, some of them are quite similar, probably forming several for small ranges (O.15 to 0.3). For query ranges of 0.4 and 0,5, clusters. This distribution also gives us an idea about choosing mvpt(3,80) makes 45% and 30% (respectively) less distance meaningful tolerance factors for similarity queries, in these sense computations compared to vpt(2). For higher query ranges, the that we can see what distance ranges can be considered similar. gain in efficiency decreases, which is due to the fact that the data If Lt metric is used, a tolerance factor (r) around 500000 is quite points in the domain are themselves quite distant from each meaningful, where if L2 metric is used, the tolerance factor other, making it harder to filter out non-qualifying points for the should be around 3000. search operations. It is also possible to use other distance measures as well. Any ~ metric can be used just like Li or L2. An ~ metric can 365 ● Higher order vp-trees perform better for wider distance #diSancs calculations par search for distributions, however the difference is not much. For datasets random vectors with narrow distance distributions, low-order vp-trees are better, I 30000 ● mvp-trees perform much better than vp-trees, The idea I+wlt(2) ? of increasing leaf capacity pays off since it decreases the number L---J +@(3) of vantage points by shortening the height of the tree, and delay + m@(3,9) the major filtering step to the leaf level +m@(3,80) ● For both random and clustered vectors, mvp-trees with high leaf-node capacity perform a considerable improvement over vp-trees, especially for small query ranges (up to 80%). The al efficiency gain (in terms of number of distance computations Leo s made) is smaller for larger query ranges, but still significant (30% for the largest ranges we have tried). u *50C0 B. Gray-Level MRI Images: We display the experimental results for the similarity -0.15 0.2 0.3 0.4 0.5 search performances of vp and mvp trees on MRI images in Query Range Figures 10 and 11. For this domain, we present the results for two vp-trees and three mvp-trees. The vp-trees are of order 2 and Figure 8. Search performances of vp and mvp 3, referred as vpt(2) and vpt(3). All the mvp-trees have the same trees for randomly generated Euclidean vectors. p parameter which is 4. The three mvp-trees are; mvpt(2,16), mvpt(2,5) and mvpt(3, 13) where for each of them, the first parameter is the order (m) and the second one is the leaf capacity #didsncacalculations per march for (k). We did not try for higher m, or k values as the number of vactora generated in clusters data items in our domain is small (1151). Actually, 4 is the maximum p value common to all three mvp-tree structures because of the low cardinality of the data domain. The results are mz averages taken after different runs for different seeds and for 30 different query objects in each run, where each query object is an MRI image selected randomly from the data set. atsncs calculations par search for L1 metric +’”@Q) El +vpt@) +IT?”@l?,h) +n?’@Q.5) + rr@(3, 13) Figure 9. Search performances of VP and mvp tre~s for Euclide~ vectors generat~ in clusters. Figure 9 shows the performance results for the data set where the vectors are generated in clusters. For this data set, o~ vpt(3) performs slightIy better than vpt(2) (around 10%). The 30 40 80 no mvp-trees perform again much better than vp-trees. mvpt(3,80) Q&y R& makes around 70% - 80% less number of distance computations than vpt(3) for small query ranges (up to 0.4), where the Figure 10. Similarity search performances of mvpt(3,9) makes around 45% - 50% less number of {p and mvp trees on MRI images when L I metric computations for the same query ranges. For higher query ranges, is used for distance computations. the gain in efficiency decreases slowly as the query range increases. For the query range 1.0, mvpt(3,80) requires 25% less Figure 10 shows the search performance of these 5 distance computations compared to vpt(3) and mvpt(3,9) requires structures when Ll metric is used. Between the vp-trees, vpt(2) 20% less. We have also run experiments on the same type of data performs around 10-20% percent better than vpt(3). mvpt(2, 16) with different cluster sizes, however the percentages did not and mvpt(2,5) perform very close to each other, both having differ much. around 10% edge over vpt(2). The best one is mvpt(3,13) We can summruize our observations as follows: performing around 20-30% less number of distance computations compared to vpt(2). 366 that it is not possible or it is not cost efficient to impose a global # distance calculations per ~arch for L2 total order or a grouping mechanism on the objects of the metric application data domain. We plan to look further into this 800 T problem of extending mvp-trees with insertion and deletion operations that would not imbalance the structure. It would be also interesting to determine the best vantage point for a given set of data objects, Methods to determine better vantage points with a little extra cost would pay off in search queries by causing less number of distance computations to be done. We also plan to look further into this problem. References [AFA93] R. Agrawal, C. Faloutsos, A. Swami. “Efficient Similarity Search In Sequence Databases”. In FODO Conference, 1993. K1 20 30 40 50 60 80 Query Range [BK73] W.A. Burkhard, R.M. Keller, “Some Approaches to Best-Match File Searching”, Communications of the ACIU, F@me 11. Similarity search performances of 16(4), pages 230-236, April 1973. vp and mvp trees on MRI images when Lz metric [BKSS90] N. Beckmann, H.P. Kriegei, R. Schneider, B. Seeger, is used for distance computations. “The R*-tree: An Efficient and Robust Access Method for Points and Rectangles”, Proceedings of the 1990 ACM SIGMOD Figure 11 shows the search performances when L2 metric Conference, At[antic City, pages 322-331, May 1990. is used. Similar to the case when LI metric was used, vpt(2) outperforms vpt(3) wih a similar approximate 10% margin. [Bri95] S. Bnn, “Near Neighbor Search in Large Metric Spaces”, mvpt(2, 16) performs better than vpt(2) but its performance Proceedings of the 21st VLDB Conference, pages 574-584, 1995. degrades for higher query range values. This should not be taken [Chi94] T. Chiueh, “Content-Based Image Indexing”, as a general result, because the random function that is used to Proceedings of the 20th VLDB Conference, pages 582-593, 1994. pick vantage points has a considerable effect on the efficiency of these structures. Similar to the previous case, mvpt(3, 13) gives [FEF+94] C. Faloutsos, W. Equitz, M. Flickner et al., “Efficient the best performance among all the structures, once again making and Effective Querying by Image Content”, journal of Intelligent 20-30% less distance computations compared to vpt(2). Information Systems (3), pages .231 -262, 1994. In summary, the experimental results for the dataset of [FRM94] C. Faloutsos, M. Ranganatban, Y. Manolopoulos. “Fast gray-level images support our previous observations about the Subsequence Matching in Time-Series Databases”. Proceedings efficiency of mvp-trees with high leaf-node capacity. Even of the 1994 ACM SIGMOD Conference, Minneapolis, pages though our image dataset has a very low cardinafity (leading to 419-429, May 1994. shallow tree structures), we were able to get around 20-30% gain [Gut84] A. Guttman, “R-Trees: A Dynamic Index Strcuture for in efficiency. If the experiments were conducted on alarger set of Spatial Searching”, Proceedings of the 1984 ACM SIGMOD images, we would expect higher performance gains. Conference, Boston, pages 47-57, June 1984. 6. Conclusions [Ott92] M. Otterman, “Approximate Matching with High In this paper, we introduced the mvp-tree, which is a Dimensionality R-trees”. M.SC Scholarly paper, Dept. of distance based index structure that can be used in any metric Computer Science, Univ. of Maryland, Collage Park, MD, 1992. data domain. Like the other distance based index structures, the Supervised by C. Faloutsos. mvp-tree does not make any assumption on the geometry of the [RKV95] N. Roussopoulos, S. Keliey, F. Vincent, “Nearest application space, and provides a filtering method for similarity Neighbor Queries”, Proceedings of the 1995 ACM SIGMOD search queries only based on relative distances between the data Conference, San Jose, pages 71-79, May 1995. objects. Similar to an existing structure, the vp-tree, mvp-tree takes the approach of partitioning the data space around vantage- [Sarn89] H. Samet, “The Design and Anafysis of Spatial Data points, but behaves much clever in choosing these points and Structures”, Addison Wesley, 1989. makes use of the pre-computed distances (at the construction [SRF87] T. Sellis, N. Roussopoulos, C. Faloutsos, “The R+-tree: stage) when answering similarity search queries. A Dynamic Index for Multi-dimensional Objects”, Proceedings Mvp-trees, like other distance based index structures, is of the13th VLDB Conference, pages 507-518, September 1987. a static index structure. It is constructed in a top down fashion on [SW90] D. Shasha, T. Wang, “New Techniques for Best-Match a static set of data points, and guarantees the fact that it is a Retrieval”, ACM Transactions on Information Systems, 8(2), balanced structure. Handling update operations (insertion and pages 140-158, 1990. deletion) witbout major restructuring, and without violating the balanced structure of the tree is an open problem. In general, the difficulty for distance-based index structures stems from the fact 367 [Uh191 ] J. K. Uhlmann, “Satisfying General ProximitylSimilanty Queries with Metric Trees”, Information Processing Letters, vol 40, pages 175-179, 1991. [Yia93] P.N. Yiannilos, “Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces”, ACM-SIAM Symposium on Discrete Algorithms, pages311 -321, 1993. Appendix Let us show the comctness of the search algorithm for Vp-trees. Let Q be the query objeet, r be the query range, S, be the vantage point of a node that we visit during the search, and M be the median distance value for the same node. We have to show that if d(Q, S,) + r < M then we do not have to search the right branch. (I) if d(Q, S,) - r > M then we do not have to search the left branch. (II) For (I), Let X denote any data object indexed in the right branch, i.e., d(X, S,) 2 M (1) M > d(Q, S,) + r (2) (hypothesis) d(Q, S,)+ d(Q, X) > d(X, S,) (3) (triangle inequality) d(Q,X) > r (4) (summation of ( l),(2), and (3)) Because of (4), X eartnot be in the query result, which means that we do not have to check any object in the right branch. For (I), Let Y denote any data object indexed in the left branch, i.e., M > d(y, S,) (5) d(Q, S,) - r > M (6) (hypothesis) d(Y, S.)+ d(Q,Y) 2 d(Q, S,) (7) (triangle incqmlity) d(Q,Y) > r (8) (summation of (5),(6), and (7)) Because of (8), Y cannot be in the query result, which means that we do not have to cheek any object in the left branch. 368

References (15)

  1. R. Agrawal, C. Faloutsos, A. Swami. "Efficient Similarity Search In Sequence Databases". In FODO Conference, 1993.
  2. W.A. Burkhard, R.M. Keller, "Some Approaches to Best-Match File Searching", Communications of the ACIU, 16(4), pages 230-236, April 1973.
  3. N. Beckmann, H.P. Kriegei, R. Schneider, B. Seeger, "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles", Proceedings of the 1990 ACM SIGMOD Conference, At[antic City, pages 322-331, May 1990.
  4. S. Bnn, "Near Neighbor Search in Large Metric Spaces", Proceedings of the 21st VLDB Conference, pages 574-584, 1995.
  5. T. Chiueh, "Content-Based Image Indexing", Proceedings of the 20th VLDB Conference, pages 582-593, 1994.
  6. C. Faloutsos, W. Equitz, M. Flickner et al., "Efficient and Effective Querying by Image Content", journal of Intelligent Information Systems (3), pages .231 -262, 1994.
  7. C. Faloutsos, M. Ranganatban, Y. Manolopoulos. "Fast Subsequence Matching in Time-Series Databases". Proceedings of the 1994 ACM SIGMOD Conference, Minneapolis, pages 419-429, May 1994.
  8. A. Guttman, "R-Trees: A Dynamic Index Strcuture for Spatial Searching", Proceedings of the 1984 ACM SIGMOD Conference, Boston, pages 47-57, June 1984.
  9. M. Otterman, "Approximate Matching with High Dimensionality R-trees". M.SC Scholarly paper, Dept. of
  10. N. Roussopoulos, S. Keliey, F. Vincent, "Nearest Neighbor Queries", Proceedings of the 1995 ACM SIGMOD Conference, San Jose, pages 71-79, May 1995.
  11. H. Samet, "The Design and Anafysis of Spatial Data Structures", Addison Wesley, 1989.
  12. T. Sellis, N. Roussopoulos, C. Faloutsos, "The R+-tree: A Dynamic Index for Multi-dimensional Objects", Proceedings of the13th VLDB Conference, pages 507-518, September 1987.
  13. D. Shasha, T. Wang, "New Techniques for Best-Match Retrieval", ACM Transactions on Information Systems, 8(2), pages 140-158, 1990.
  14. J. K. Uhlmann, "Satisfying General ProximitylSimilanty Queries with Metric Trees", Information Processing Letters, vol 40, pages 175-179, 1991.
  15. P.N. Yiannilos, "Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces", ACM-SIAM Symposium on Discrete Algorithms, pages311 -321, 1993.