Academia.eduAcademia.edu

Outline

Fully Dynamic and Memory-Adaptative Spatial Approximation Trees

Abstract

Hybrid dynamic spatial approximation trees are recently proposed data structures for searching in metric spaces, based on combining the concepts of spatial approximation and pivot based algorithms. These data structures are hybrid schemes, with the full features of dynamic spatial approximation trees and able of using the available memory to improve the query time. It has been shown that they compare favorably against alternative data structures in spaces of medium difficulty.

Fully Dynamic and Memory-Adaptative Spatial Approximation Trees Diego Arroyuelo1 Gonzalo Navarro2 Nora Reyes1 1 2 Depto. de Inform´atica Center for Web Research Universidad Nacional de San Luis Dept. of Computer Science Ej´ercito de los Andes 950 University of Chile San Luis, Argentina Blanco Encalada 2120, Santiago, Chile fdarroy,[email protected] [email protected] Abstract Hybrid dynamic spatial approximation trees are recently proposed data structures for search- ing in metric spaces, based on combining the concepts of spatial approximation and pivot based algorithms. These data structures are hybrid schemes, with the full features of dynamic spatial approximation trees and able of using the available memory to improve the query time. It has been shown that they compare favorably against alternative data structures in spaces of medium difficulty. In this paper we complete and improve hybrid dynamic spatial approximation trees, by pre- senting a new search alternative, an algorithm to remove objects from the tree, and an improved way of managing the available memory. The result is a fully dynamic and optimized data structure for similarity searching in metric spaces. Key Words: databases, data structures, algorithms, metric spaces.  This work was partially supported by CYTED VII.19 RIBIDI Project (all authors) and Millenium Nucleus Center for Web Research, Grant P01-029-F, Mideplan, Chile (second author). 1 Introduction “Proximity” or “similarity” searching is the problem of looking for objects in a set close enough to a query. This has applications in a vast number of fields. The problem can be formalized with the metric space model [2]: There is a universe U of objects, and a positive real-valued distance function R d : U  U ! + defined among them, which satisfies the metric properties: strict positiveness (d(x; y ) = 0 , x = y ), symmetry (d(x; y ) = d(y; x)), and triangle inequality (d(x; z ) 6 d(x; y ) + d(y; z )). The smaller the distance between two objects, the more “similar” they are. We have a finite database S  U that can be preprocessed to build an index. Later, given a query q 2 U, we must retrieve all similar elements in the database. We are mainly interested in the range query: Retrieve all elements in S within distance r to q , that is, fx 2 S; d(x; q ) 6 r g. Generally, the distance is expensive to compute, so one usually defines the search complexity as the number of distance evaluations performed. Proximity search algorithms build an index of the database to speed up queries, avoiding the exhaustive search. Many of these indexes are based on pivots (Section 2). In this paper we complete and improve a hybrid index for metric space searching built on the dsa–tree [3], an index supporting insertions and deletions that is competitive in spaces of medium difficulty, but unable of taking advantage of the available memory. This was enriched with a pivoting scheme in [1]. Pivots use the available memory to improve query time, and in this way they can beat any other structure, but too many pivots are needed in difficult spaces. Our new structure was still dynamic and made better use of memory, beating both dsa-trees and basic pivots. Now we present a new search alternative, a deletion algorithm, and a way of managing the available memory for hybrid dynamic spatial approximation trees. In this way we complete and improve our previous work [1]. 2 Pivoting Algorithms Essentially, pivoting algorithms choose some elements pi from the database S , and precompute and store all distances d(a; pi ) for all a 2 S . At query time, they compute distances d(q; pi ) against the pivots. Then the distance by pivots between a 2 S and q gets defined as D(a; q ) = maxpi jd(a; pi ) d(q; pi )j. It can be seen that D(a; q ) 6 d(a; q ) for all a 2 S; q 2 U. This is used to avoid distance evaluations. Each a such that D(a; q ) > r can be discarded because we deduce d(a; q ) > r without actually computing d(a; q ). All the elements that cannot be discarded this way are directly compared against q . Usually pivoting schemes perform better as more pivots are used, this way beating any other index. They are, however, better suited to “easy” metric spaces [2]. In hard spaces they need too many pivots to beat other algorithms. 3 Dynamic Spatial Approximation Trees In this section we briefly describe dynamic sa–trees (dsa-trees for short), in particular the version called timestamp with bounded arity [3], on top of which we build. 3.1 Insertion Algorithm The dsa–tree is built incrementally, via insertions. The tree has a maximum arity. Each tree node a stores a timestamp of its insertion time, time(a), and its covering radius, R(a), which is the maximum distance to any element in its subtree. Its set of children is called N (a), the neighbors of a. To insert a new element x, its point of insertion is sought starting at the tree root and moving to the neighbor closest to x, updating R(a) in the way. We finally insert x as a new (leaf) child of a if (1) x is closer to a than to any b 2 N (a), and (2) the arity of a, jN (a)j, is not already maximal. Neighbors are stored left to right in increasing timestamp order. Note that the parent is always older than its children. 3.2 Range Search Algorithm The idea is to replicate the insertion process of elements to retrieve. That is, we act as if we wanted to insert q but keep in mind that relevant elements may be at distance up to r from q , so in each decision for simulating the insertion of q we permit a tolerance of r . So it may be that relevant elements were inserted in different children of the current node, and backtracking is necessary. Note that, at the time an element x was inserted, a node a may not have been chosen as its parent because its arity was already maximal. So, at query time, we must choose the minimum distance to x only among N (a). Note also that, when x was inserted, elements with higher times- tamp were not yet present in the tree, so x could choose its closest neighbor only among older elements. Hence, we consider the neighbors fb1 ; : : : ; bk g of a from oldest to newest, disregard- ing a, and perform the minimization as we traverse the list. That is, we enter into subtree bi if d(q; bi ) 6 min (d(q; b1 ); : : : ; d(q; bi 1 )) + 2r. We use timestamps to reduce the work inside older neighbors. Say that d(q; bi ) > d(q; bi+j ) + 2r . We have to enter subtree bi anyway because bi is older. However, only the elements with timestamp smaller than time(bi+j ) should be considered when searching inside bi ; younger elements have seen bi+j and they cannot be interesting for the search if they are inside bi . As parent nodes are older than their descendants, as soon as we find a node inside subtree bi with timestamp larger than time(bi+j ) we can stop the search in that branch. Algorithm 1 performs range searching. Note that, except in the first invocation, d(a; q ) is already known from the invoking process. 3.3 Deletion Algorithm To delete an element x, the first step is to find it in the tree. In which follows, we do not consider the location of the object as part of the deletion problem, although in [3] we have shown how to proceed if necessary. It should be clear that a tree leaf can always be removed without any complication, so we focus on how to remove internal tree nodes. The deletion of elements by rebuilding subtrees ensures that the resulting tree is exactly as if the deleted element had never been inserted. Thus, no degradation can occur due to repeated deletions. In such algorithm, when node x 2 N (a) is deleted, we disconnect x from the main tree. Hence all its descendants must be reinserted. Moreover, elements in the subtree of a that are younger than x have been compared against x to decide their insertion point. Therefore, these elements, in absence of x, could choose another path if we reinsert them into the tree. Then, we retrieve all the elements younger R ANGE S EARCH (Node a; Query q; Radius r; Timestamp t) 1. if time(a) < t ^ d(a; q ) 6 R(a) + r then 2. if d(a; q ) 6 r then report a 3. 1 2 dmin 4. for bi N (a) in increasing timestamp order do 5. if d(bi ; q ) 6 dmin + 2r then 6. k f g min j > i; d(bi ; q ) > d(bj ; q ) + 2r 7. R ANGE S EARCH(bi ; q; r; time(bk )) 8. dmin f g min dmin ; d(bi ; q ) Algorithm 1: Range query algorithm on a dsa–tree with root a. than x that descend from a (i.e. those whose timestamp is greater, which includes its descendants) and reinsert them into the tree, leaving the tree as if x had never been inserted. If we reinsert the elements younger than x like completely new elements, that is if they get fresh timestamps, we must search the appropriate point of reinsertion beginning at tree root. On the other hand, if we maintain their timestamp we can start the reinsertion process from a, so we can save many comparisons. In order to leave the resulting tree exactly as if x never had been inserted, we must reinsert the elements in the original order, that is, the elements must be reinserted in increasing order of timestamp. Therefore, when node x 2 N (a) is deleted we retrieve all the elements younger than x from the subtree rooted a, then disconnect them from the main tree, sort them in increasing order of timestamp and reinsert them one by one, searching their reinsertion point from a. Note that in this method the covering radii can become overestimated, because they are never reduced due to a deleted element. If we delete an element x, every a 2 A(x) such that x was the farthest element in its subtree will possibly have its R(a) overestimated. In spite of it, this problem does not seem to affect much search performance since it does not significantly degrade over time (see [4] for more details). 4 Fully Dynamic Sa–trees with Pivots Hybrid dynamic sa–trees were defined in [1], although without handling deletions. We review some of their main features and then present a deletion algorithm. Pivoting techniques can trade memory space for query time, but they perform well on easy spaces only. A dsa–tree, on the other hand, is suitable for searching spaces of medium difficulty. However, it uses a fixed amount of memory, being unable of taking advantage of additional memory to improve query time. The idea is to obtain a hybrid data structure that gets the best of both worlds, by enriching dsa–trees with pivots. The result is better than both building blocks. We choose different pivots for each tree node, such that we do not need any extra distance evalu- ations against pivots, either at insertion or search time. Recall that, after we find the insertion point of a new element x, say x 2 N (a), x has been compared against all its ancestors in the tree, all the siblings of its ancestors, and its own siblings in N (a). At query time, when we reach node x, some distances between q and the aforementioned elements have also been computed. So, we can use (some of) these elements as pivots to obtain better search performance, without introducing extra distance computations. Next we present different ways to choose the pivots of each node. 4.1 H–D SAT 1: Using Ancestors as Pivots A natural alternative is to regard the ancestors of each node as its pivots. Let A(x) be the set of ancestors of x 2 S . We define P (x) = f(pi ; d(x; pi )); pi 2 A(x)g. This set is computed during insertion of x, by storing some of the distance evaluations computed in this process. We store P (x) at each node x and use it to prune the search. 4.1.1 Insertion Algorithm To insert an element x, we set P (x) = ; and begin searching for the insertion point of x. For each node a we choose in our path, we add (a; d(x; a)) to P (x). When the insertion point of x is found, P (x) contains the distances to the ancestors of x. Note that we do not perform any extra distance evaluations to build P (x). Thus, the construction cost of a H–D SAT 1 is the same of a dsa–tree. 4.1.2 Range Search Algorithm For range searching, we modify the dsa-tree algorithm to use the set P (x) stored at each tree node x. We recall that, given a set of pivots, the distance by pivots D(a; q ) is a lower bound for d(a; q ). Consider again Algorithm 1. If at step 1 it holds that D(a; q ) > R(a) + r , then surely d(a; q ) > R(a) + r, and hence we can stop the search at node a without actually evaluating d(a; q ). An element a in S is said to be feasible for query q if D(a; q ) 6 R(a) + r. That is, is feasible that a or some element in its subtree lies within the search radius of q . At search time, D(a; q ) can be computed without additional evaluations of d. Suppose that we reach node pk of the structure and want to decide if the search must follow into the subtree of x 2 N (pk ). At this point, we have computed all distances d(q; pi ); pi 2 A(x). If A(x) = fp1 ; : : : ; pk g, then these distances are d(q; p1 ); : : : ; d(q; pk ). As the set P (x) = f(p1 ; d(x; p1 )); : : : ; (pk ; d(x; pk ))g is stored in the node of x, then the distances d(x; pi ) and d(q; pi ) needed to compute D(x; q ) are present, at no extra costs. The distances d(q; pi ) are stored in a stack as the search goes up and down the tree. The sets P (x) are also stored in root-to-x order, for example in a linear array, so that references to the pivots in P (x) (first component of pairs) are unnecessary to correctly compute D, and we save space. The covering radius feasible neighbors of node a (feasible neighbors), denoted F (a), are the neighbors b 2 N (a) such that D(b; q ) 6 R(b) + r . The other neighbors are said to be infeasibles. At search time, if we reach node a, only the feasible neighbors of a could be taken into account, as other subtrees can be discarded completely. Observe that these subtrees have been discarded using D and not d and, as we have explained, D is computed for free. However, it does not immediately follow that we obtain for sure an improvement in search performance. The reason is that infeasible nodes still serve to reduce dmin in Algorithm 1, which in turn may save us entering into younger siblings. Hence, by saving computations against infeasible nodes, we may have to enter into new siblings later. This is an intrinsic price of our method. R ANGE S EARCH H–D SAT 1 (Node a; Query q; Radius r; Timestamp t) 1. if time(a) < t ^ d(a; q ) 6 R(a) + r then 2. if d(a; q ) 6 r then report a 3. 1 f2 g dmin 4. F (a) D N (a); (b; q ) 6 R(b) + r 2 b 5. for bi N (a) in increasing timestamp order do 6. 2 ^D if bi F (a) (bi ; q ) 6 dmin + 2r then 7. if d(bi ; q ) 6 dmin + 2r then 8. kf g min j > i; d(bi ; q ) > d(bj ; q ) + 2r 9. R ANGE S EARCH H–D SAT 1(bi ; q; r; time(bk )) 10. if d(bi ; q ) has already been computed then dmin f min dmin ; g d(bi ; q ) Algorithm 2: Range searching for query q with radius r in a H–D SAT 1 with root a. Now we present a new search alternative not devised in [1]. The idea is to use D along with the hyperplane criterion to save distance computations at search time. For any feasible element bi such that D(bi ; q ) > dmin + 2r , it holds that d(bi ; q ) > dmin + 2r . Hence, we can stop the search in the feasible node bi without evaluating d(bi ; q ). In which follows we present different alternatives of the search algorithm. The Algorithm 2 shows the first alternative. However, in step 8 we run into the risk of comparing infeasible elements against q . This is done in order to use timestamp information as much as possible, but it introduces the undesired effect of reducing the benefits of pivots in our data structure. We present some improvements to this weakness. Optimizing using D. We make use of D at search time not only to determine the feasibility of a node and prune the search space saving distance evaluations, but also to decrease the number of infeasible elements that are compared directly against q in step 8 of the algorithm. We search inside bi using the timestamp t of a younger sibling bk of bi . Fortunately, some of the necessary comparisons can be saved by making use of D. The key observation is that d(bi ; q ) 6 D(bj ; q ) + 2r implies d(bi ; q ) 6 d(bj ; q ) + 2r, so if d(bi ; q ) 6 D(bj ; q ) + 2r we can conclude that bj is not of interest in step 8, hence saving the computation of d(bj ; q ). Although we save some distance computations and obtain the same result, still there will be infeasible elements compared against q . We call H–D SAT 1D this search method. Using Timestamps of Feasible Neighbors. The use of timestamps is not essential for the correct- ness of the algorithms. Any larger value would do, although the optimal choice is to use the smallest correct timestamp. Another alternative is to compute a safe approximation to the correct timestamp, but ensuring that no infeasible elements are ever compared against q . Note that every feasible neighbor of a node will be compared against q inevitably. If for bi 2 F (a) it holds that d(bi ; q ) 6 dmin +2r , then we compute the oldest timestamp t among the reduced set fbi+j 2 F (a); d(bi ; q ) > d(bi+j ; q ) + 2r g, and stop the search inside bi at nodes whose timestamp is newer than t. This ensures that only feasible elements are compared against q , and under that condition it uses as much timestamping information as possible. We call H–D SAT 1F this alternative. 4.1.3 Deletion Algorithm We adapt the algorithm of rebuilding subtrees [4] to delete an element x 2 N (a) from the tree, such that it takes into account the existence of pivots. The reinsertion process involve distance evaluations, some of which are already precomputed as pivot information. We show how to take advantage of this. Note that a is a pivot of every element in its subtree. In other words, the distance d(b; a) has been stored in P (b), for every b in the subtree of a. As a result, we can save at least one distance evaluation for each element to be reinserted. Furthermore, if y is an older sibling of x, and y is a pivot (ancestor) of b then we can save the distance evaluation d(b; y ) when reinserting b. As P (b) is stored using a linear array, the position of d(b; a) in P (b) can be easily computed. If a lies at level i of the tree, then d(b; a) is at position i of the array. It is important to note that, after reinserting b, the node a and the ancestors of a will be ancestors of b. Then, we must keep in P (b) the distances between b and the ancestors of a. The other distances in P (b) are discarded before reinserting b. Finally, the new set P (b) is completed using some of the distance evaluations produced at reinsertion time. The deletion process of x 2 N (a) can be resumed as follows. We retrieve all the elements younger than x from the subtree rooted a, then disconnect them from the main tree, discard the distances to the pivots that are not of interest after reinserting the node, sort the nodes in increasing order of timestamp and reinsert them one by one (reusing some distances in P ), searching their reinsertion point from a. The result is very important: we do not introduce extra distance evaluations in the deletion process, and even more, the deletion cost can be reduced by using pivots. 4.2 H–D SAT 2: Using Ancestors and their Older Siblings as Pivots We aim at using even more pivots than H–D SAT 1, to improve even more the search performance. At search time, when we reach a node a, q has been compared against all the ancestors and some of the older siblings of ancestors of a. Hence, we use this extended set of pivots for each node a. 4.2.1 Insertion Algorithm The only difference in a H–D SAT 2 is in the P (x) sets we compute. Let x 2 S and A(x) = fp1 ; : : : ; p g be the set of its ancestors, where p is the ancestor at tree level i. Note that p +1 2 N (p ). k i i i Hence, (b; d(x; b)) 2 P (x) if and only if (1) b 2 A(x), or (2) p ; p +1 2 A(x) ^ b 2 N (p ) ^ time(b) < i i i time(pi+1 ). 4.2.2 Range Search Algorithm For range searching, to compute D(x; q ) we need the distances between q and the pivots of x stored in a stack. But it is possible that some of the pivots of x have not been compared against q because they were infeasible. In order to retain the same pivot order of P (x), we push invalid elements into the stack when infeasible neighbors are found. D is then computed having this in mind. We define the same variants of the search algorithm for H–D SAT 2, which only differ from H–D SAT 1 in the way of computing D. 4.2.3 Deletion Algorithm Now suppose we want to delete an element x 2 N (a) in H–D SAT 2. Again, a is a pivot of each element b in the subtree rooted a. Thus, we can avoid at least one distance evaluation for element to be reinserted. But we can do more. The comparison d(b; y ) can be avoided, for all y 2 N (a) older than x, since those distances have been stored in P (b). The process of computing the position of the distances in P (b) is not so direct as before: these must be computed as we retrieve the nodes to be reinserted. Because of the features of H–D SAT 2, it is possible that the deletion cost can be reduced even more. 5 Limiting the Use of Storage In practice, available memory is bounded. Our data structures, as have been defined, use memory in a non-controlled way (each node uses as much pivots as the definition requires). This fact rules out our solutions for many real-life situations. We show how to adapt the structures to fit the available memory. The idea is to restrict the number of pivots stored in each node to a value k , by holding a subset of the original set of pivots. As a result, the data structures lose some performance at search time. A way of minimizing such degradation is to choose a “good” set of pivots for each node. 5.1 Choosing Good Pivots We study empirically the features of the pivots that discard elements at search time. In our experiment, each time a pivot discards an element, we mark that pivot (for more details see Section 6). Because of the insertion process of H–D SAT 1, the latest pivots of a node should be good since they are close, and hence good representatives, of the node. We verify experimentally that most discards using pivots were due to the latter ones. Figure 1 (left) shows that a small number of latter pivots per node suffices. In dimension 5, about 10 pivots per node discard all the elements that can be discarded using pivots. In higher dimensions, even less pivots are needed. We call H–D SAT 1 k Latest to this alternative. Percentage of elements discarded by differents choices of Percentage of elements discarded by differents choices of pivots in H-Dsat1 for n=100,000 vectors, dim. 5, arity 4 pivots in H-Dsat2 for n=100,000 vectors, dim. 15, arity 32 100 100 Percentage discarded by pivots Percentage discarded by pivots 90 90 80 80 70 70 60 60 50 50 40 30 40 Retrieve 0.01% 20 Latest 30 Retrieve 0.1% 10 Nearest Retrieve 1% Random 20 0 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70 80 90 100 Number of pivots Number of pivots Figure 1: Percentage of elements discarded using the latest pivots in H–D SAT 1 (left), and using the latest and nearest pivots in H–D SAT 2 (right). The ancestors of a node are close to it, but the siblings of the ancestors are not necessarily close. So we expect that using the k latest pivots in H–D SAT 2 (H–D SAT 2 k Latest) does not perform as well as before. An obvious alternative is H–D SAT 2 k Nearest, which uses the k nearest pivots, not the k latest. Figure 1 (right) confirms that less nearest pivots are needed to discard the same number of nodes as latest pivots. However, note that for H–D SAT 2 k Nearest we need to store the references to the pivots in order to compute D. Hence, given a fixed amount of memory, this alternative must use less pivots per node than the others. We have introduced a parameter k in our data structures, which can be easily tuned since it depends on the available memory. When k = 0 the data structure becomes the original dsa–tree, and when k = 1 it becomes our unrestricted data structures of Section 4. 5.2 Choosing Good Nodes The dual question is whether some tree nodes profit more from pivots than others. We experimentally study the features of the elements that are discarded using pivots. The result is that, for all the metric spaces used, the discarded elements are located near the leaves in the tree. In the vector space of dimension 5, the percentage varies from 40% to 60% (depending of the query radius), while in the space of dimension 15, almost the 100% of the elements discarded by pivots are leaves. In the dictionary this percentage varies from 80% to 90%. The reason is that the covering radii of the nodes decrease as we go down in the tree, being zero in the leaves. As the covering radius infeasibility condition for a node a is D(a; q ) > R(a) + r , the probability of discarding a increases when R(a) decreases. Suppose that we restrict the number of pivots per node to a value k . As leaves are discarded more frequently than internal nodes, we consider an alternative that profits from this fact when using limited memory. The idea is to move the storage of pivots to the leaves smoothly and dynamically. We have a parameter , which is 0 6  6 1. rho allows us to determine the number of pivots per node such that: (1) internal nodes have k pivots (unless they do not have so many to choose), and (2) external nodes have all the pivots that the scheme permits (unless there is not enough available space). The way to implement this is as follows: When an external node becomes internal it retains k of its pivots, and it yields the others to the public repository, and when a new external node appears it takes from the repository all the pivots that it needs (whenever the repository has that many, in other case it takes all the available ones). In this way, each new element attempts to take a number of pivots as close as possible to its original number of pivots, and memory usage tends to move dynamically to the leaves. The parameter  allows us to control the degree of movement of storage to the leaves. Note that when  = 0, all the pivots move to the leaves, and when  = 1 the memory management has no effect. Figure 2 shows the experimental query cost for H–D SAT 2 k Nearest, in the vector space of di- mension 15, using k = 5 and k = 35 pivots, and values 0, 0:1, 0:5, and 1 for . In this metric space, we get the best performance with  = 0. 6 Experimental Results In this section we present a series of experiments performed on our data structures. We have evaluated our structures in three metric spaces. First, a dictionary of 69,069 English words under edit distance Query cost for H-Dsat2 and memory management, Query cost for H-Dsat2 and memory management, for k = 5 pivots, n = 100,000 vectors, dim. 15, arity 32 for k = 35 pivots, n = 100,000 vectors, dim. 15, arity 32 85 85 Percentage of database examined Percentage of database examined 80 80 75 75 70 70 65 65 60 60 55 55 50 50 rho = 0.0 rho = 0.0 rho = 0.1 45 rho = 0.1 45 rho = 0.5 40 rho = 0.5 rho = 1.0 rho = 1.0 40 35 0.01 0.1 1 0.01 0.1 1 Percentage of database retrieved Percentage of database retrieved Figure 2: Performance of H–D SAT 2 k Nearest with memory management, and various values of . We selected values of k = 5 pivots (left) and k = 35 pivots (right). (minimum number of character insertions, deletions and substitutions to make the strings equal), of interest in spelling applications. The other spaces are real unitary cubes in dimensions 5 and 15 under Euclidean distance, using 100,000 uniformly distributed random points. We treat these just as metric spaces, disregarding coordinate information. In all cases, we left apart 100 random elements to act as queries. The data structures were built 20 times varying the order of insertions. We tested arities 4, 8, 16, and 32. Each tree built was queried 100 times, using radii 1 to 4 in the dictionary, and radii retrieving 0.01%, 0.1%, and 1% of the set in vector spaces. In [1] we show that H–D SAT 1F outperformed H–D SAT 1D, clearly in the dictionary and slightly in vector spaces. The results are similar on H–D SAT 2. Also we show experimentally that our struc- tures are competitive, as our best versions of H–D SAT 1 and H–D SAT 2 largely improve upon dsa– trees. This shows that our structures make good use of extra memory. H–D SAT 2 can use more memory than H–D SAT 1, and hence its query cost is better. However, there is a price in memory usage, e.g., H–D SAT 1 needs 1.3 to 4.0 times the memory of dsa–tree, while H–D SAT 2 requires 5.2 to 17.5 times. Hence the interest in comparing how well our structures use limited memory compared to others. Figure 3 and Figure 4 compare against a generic pivot data structure, using the same amount of memory in all cases. We also show a dsa–tree as a reference point, as it uses a fixed amount of memory. In easy spaces (dimension 5 or dictionary) we do better when there is little available memory, but in dimension 15 H–D SAT 2 is always the best. More pivots are needed to beat H–D SAT in harder problems. 7 Conclusions In this paper we have completed a hybrid scheme for similarity searching in metric spaces. Such scheme is basically a dsa–tree, except that a set of pivots is associated with each node. The set of pivots of each node is chosen in such a way that no extra distance evaluations are introduced at insertion time: we only save some of the distance computations that occur when inserting a new element. At search time, when we reach a node, we use pivots to prune the search space for free. Query cost for radius retrieving 0.01% of database Query cost for radius retrieving 0.01% of database n=100,000 vectors, dim. 5 n=100,000 vectors, dim. 15 9 Percentage of database examined 100 Percentage of database examined 8 H-Dsat2 latest H-Dsat2 nearest 90 H-Dsat2 latest 7 H-Dsat1 latest H-Dsat2 nearest 6 Pivots 80 H-Dsat1 latest Dsat Pivots 5 70 Dsat 4 60 3 50 2 40 1 0 30 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Number of pivots in H-Dsat2 latest Number of pivots in H-Dsat2 latest Query cost for radius retrieving 0.1% of database Query cost for radius retrieving 0.1% of database n=100,000 vectors, dim. 5 n=100,000 vectors, dim. 15 25 Percentage of database examined 100 Percentage of database examined H-Dsat2 latest 95 H-Dsat2 latest 20 H-Dsat2 nearest H-Dsat2 nearest H-Dsat1 latest 90 Pivots H-Dsat1 latest Dsat 85 Pivots 15 Dsat 80 75 10 70 5 65 60 0 55 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Number of pivots in H-Dsat2 latest Number of pivots in H-Dsat2 latest Query cost for radius retrieving 1% of database Query cost for radius retrieving 1% of database n=100,000 vectors, dim. 5 n=100,000 vectors, dim. 15 50 Percentage of database examined 100 Percentage of database examined 45 H-Dsat2 latest 98 H-Dsat2 nearest 40 H-Dsat1 latest 96 35 Pivots 94 Dsat 30 92 25 90 H-Dsat2 latest 20 88 H-Dsat2 nearest 15 86 H-Dsat1 latest 10 84 Pivots Dsat 5 82 0 80 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Number of pivots in H-Dsat2 latest Number of pivots in H-Dsat2 latest Figure 3: Query cost of H–D SAT 1F and H–D SAT 2F versus a pivoting algorithm, in vector spaces. We study the way to choose good pivots when the amount of memory is limited. Several alterna- tives are explored and evaluated. In this paper we have also presented a method to delete elements from a hybrid dynamic spatial approximation tree. This method has shown to be better than that of the original method over a dsa–tree. The outcome is a fully dynamic data structure that can be managed through insertions and dele- Query cost for radius 2, n=69,069 words Query cost for radius 3, n=69,069 words 45 65 Percentage of database examined Percentage of database examined 40 60 35 H-Dsat2 latest 55 H-Dsat2 nearest 30 H-Dsat1 latest 50 Pivots 25 Dsat 45 20 H-Dsat2 latest H-Dsat2 nearest 40 15 H-Dsat1 latest 10 Pivots 35 Dsat 5 30 0 25 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Number of pivots in H-Dsat2 latest Number of pivots in H-Dsat2 latest Query cost for radius 4, n=69,069 words 80 Percentage of database examined 75 H-Dsat2 latest 70 H-Dsat2 nearest H-Dsat1 latest 65 Pivots Dsat 60 55 50 0 5 10 15 20 25 30 35 40 45 50 Number of pivots in H-Dsat2 latest Figure 4: Query cost of H–D SAT 1F and H–D SAT 2F versus a pivoting algorithm, in the dictionary. tions over arbitrarily long periods of time without any reorganization, and that can take advantage of available memory to improve search and deletion costs. References [1] D. Arroyuelo, F. Mu˜noz, G. Navarro, and N. Reyes. Memory–adptative dynamic spatial approx- imation trees. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003), LNCS. Springer, 2003. To appear. [2] E. Ch´avez, G. Navarro, R. Baeza-Yates, and J.L. Marroqu´ın. Proximity searching in metric spaces. ACM Computing Surveys, 33(3):273–321, September 2001. [3] G. Navarro and N. Reyes. Fully dynamic spatial approximation trees. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE 2002), LNCS 2476, pages 254–270. Springer, 2002. [4] G. Navarro and N. Reyes. Improved deletions in dynamic spatial approximation trees. In Proc. of the XXIII International Conference of the Chilean Computer Science Society (SCCC’03). IEEE CS Press, 2003. To appear.

References (4)

  1. D. Arroyuelo, F. Muñoz, G. Navarro, and N. Reyes. Memory-adptative dynamic spatial approx- imation trees. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003), LNCS. Springer, 2003. To appear.
  2. E. Chávez, G. Navarro, R. Baeza-Yates, and J.L. Marroquín. Proximity searching in metric spaces. ACM Computing Surveys, 33(3):273-321, September 2001.
  3. G. Navarro and N. Reyes. Fully dynamic spatial approximation trees. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE 2002), LNCS 2476, pages 254-270. Springer, 2002.
  4. G. Navarro and N. Reyes. Improved deletions in dynamic spatial approximation trees. In Proc. of the XXIII International Conference of the Chilean Computer Science Society (SCCC'03). IEEE CS Press, 2003. To appear.