Academia.eduAcademia.edu

Outline

Dynamic spatial approximation trees

https://doi.org/10.1145/1227161.1322337

Abstract
sparkles

AI

Dynamic spatial approximation trees provide a new methodology for efficiently reconstructing high-dimensional data structures. By optimizing the construction process and allowing for dynamic updates, these trees improve both memory utilization and access times for large vector datasets, making them suitable for applications in data mining and machine learning. The proposed structure significantly reduces computational overhead, enabling faster queries and updates, while also handling scalability for vast data sets.

Dynami Spatial Approximation Trees Gonzalo Navarro  Nora Reyes Dept. of Computer S ien e Depto. de Informati a University of Chile Universidad Na ional de San Luis Blan o En alada 2120, Santiago, Chile Ejer ito de los Andes 950, San Luis, Argentina gnavarrod .u hile. l nreyesunsl.edu.ar Abstra t quantization and ompression (where only some ve - tors an be represented and we ode the others as their The Spatial Approximation Tree (sa-tree) is a re- losest representable point); omputational biology (to ently proposed data stru ture for sear hing in metri nd a DNA or protein sequen e in a database allowing spa es. It has been shown that it ompares favorably some errors due to mutations); and fun tion predi tion against alternative data stru tures in spa es of high (to sear h for the most similar behavior of a fun tion in dimension or queries with low sele tivity. The main the past so as to predi t its probable future behavior). drawba k of the sa-tree is that it is a stati data stru - All those appli ations have some ommon hara - ture, that is, on e built, it is diÆ ult to add new el- teristi s. There is a universe U of obje ts, and a non- ements to it. This rules it out for many interesting negative distan e fun tion d : U  U ! R+ de ned appli ations. among them. This distan e satis es the three axioms In this paper we over ome this weakness. We pro- that make the set a metri spa e: stri t positiveness pose and study several methods to handle insertions (d(x; y ) = 0 , x = y ), symmetry (d(x; y ) = d(y; x)) in the sa-tree. Some are lassi al folklore solutions and triangle inequality (d(x; z )  d(x; y ) + d(y; z )). well known in the data stru tures ommunity, while the The smaller the distan e between two obje ts, the most promising ones have been spe i ally developed more \similar" they are. We have a nite database onsidering the parti ular properties of the sa-tree, and S  U , whi h is a subset of the universe of obje ts and involve new algorithmi insights in the behavior of this an be prepro essed (to build an index, for example). data stru ture. As a result, we show that it is viable to Later, given a new obje t from the universe (a query modify the sa-tree so as to permit fast insertions while q ), we must retrieve all similar elements found in the keeping its good sear h eÆ ien y. database. There are two typi al queries of this kind: Range query: Retrieve all elements within distan e r to q in S . This is, fx 2 S ; d(x; q )  rg. 1. Introdu tion Nearest neighbor query (k -NN): Retrieve the k losest elements to q in S . That is, a set A  S The on ept of \approximate" sear hing has appli- su h that jAj = k and 8x 2 A; y 2 S A; d(x; q )  ations in a vast number of elds. Some examples are d(y; q ). non-traditional databases (e.g. storing images, nger- prints or audio lips, where the on ept of exa t sear h The distan e is onsidered expensive to ompute is of no use and we sear h instead for similar obje ts); (think, for instan e, in omparing two ngerprints). text sear hing (to nd words and phrases in a text Hen e, it is ustomary to de ne the omplexity of database allowing a small number of typographi al or the sear h as the number of distan e evaluations per- spelling errors); information retrieval (to look for do - formed, disregarding other omponents su h as CPU uments that are similar to a given query or do ument); time for side omputations, and even I/O time. Given ma hine learning and lassi ation (to lassify a new a database of jS j = n obje ts, queries an be trivially element a ording to its losest representative); image answered by performing n distan e evaluations. The goal is to stru ture the database su h that we perform  Partially supported by Fonde yt grant 1-000929. less distan e evaluations. A parti ular ase of this problem arises when the it is diÆ ult to add new elements to it. This makes spa e is a set of d-dimensional points and the dis- the sa-tree unsuitable for dynami appli ations su h as P tan e belongs to the Minkowski Lp family: Lp = ( 1id jxi yi jp )1=p . The best known spe ial ases multimedia databases. Over oming this weakness is the aim of this paper. are p = 1 (Manhattan distan e), p = 2 (Eu lidean We propose and study several methods to handle inser- distan e) and p = 1 (maximum distan e), that is, tions in the sa-tree. Some are lassi al folklore solutions L1 = max1id jxi y i j. well known in the data stru tures ommunity, while the There are e e tive methods to sear h on d- most promising ones have been spe i ally developed dimensional spa es, su h as kd-trees [2℄ or R-trees [13℄. onsidering the parti ular properties of the sa-tree. As However, for roughly 20 dimensions or more those a result, we show that it is viable to modify the sa-tree stru tures ease to work well. We fo us in this paper in so as to permit fast insertions while keeping its good general metri spa es, although the solutions are well sear h eÆ ien y. As a related byprodu t of this study, suited also for d-dimensional spa es. we give new algorithmi insights in the behavior of this It is interesting to noti e that the on ept of \di- data stru ture. mensionality" an be translated to metri spa es as well: the typi al feature in high dimensional spa es 2. Previous Work with Lp distan es is that the probability distribution of distan es among elements has a very on entrated histogram (with larger mean as the dimension grows), Algorithms to sear h in general metri spa es an be making the work of any similarity sear h algorithm divided in two large areas: pivot-based and lustering more diÆ ult [5, 10℄. In the extreme ase we have a algorithms. (See [10℄ for a more omplete review.) spa e where d(x; x) = 0 and 8y 6= x; d(x; y ) = 1, where it is impossible to avoid a single distan e evaluation at sear h time. We say that a general metri spa e is high Pivot-based algorithms. The idea is to use a set dimensional when its histogram of distan es is on en- of k distinguished elements (\pivots") p1 :::pk 2 S trated. and storing, for ea h database element x, its dis- There are a number of methods to prepro ess the set tan e to the k pivots (d(x; p1 ):::d(x; pk )). Given the in order to redu e the number of distan e evaluations. query q , its distan e to the k pivots is omputed All those stru tures work on the basis of dis arding (d(q; p1 ):::d(q; pk )). Now, if for some pivot pi it holds elements using the triangle inequality, and most use that jd(q; pi ) d(x; pi )j > r, then we know by the tri- the lassi al divide-and- onquer approa h (whi h is not angle inequality that d(q; x) > r and therefore do not spe i of metri spa e sear hing). need to expli itly evaluate d(x; p). All the other el- The Spatial Approximation Tree (sa-tree) is a re- ements that annot be eliminated using this rule are ently proposed data stru ture of this kind [16℄, whi h dire tly ompared against the query. is based on a novel on ept: rather than dividing the Several algorithms [23, 15, 7, 18, 6, 8℄ are almost sear h spa e, approa h the query spatially, that is, dire t implementations of this idea, and di er basi ally start at some point in the spa e and get loser and in their extra stru ture used to redu e the CPU ost of loser to the query. It has been shown that the sa- nding the andidate points, but not in their number tree behaves better than the other existing stru tures of distan e evaluations. on metri spa es of high dimension or queries with low There are a number of tree-like data stru tures that sele tivity, whi h is the ase in many appli ations. use this idea in a more indire t way: they sele t a pivot The sa-tree, unlike other data stru tures, does not as the root of the tree and divide the spa e a ording have parameters to be tuned by the user of ea h ap- to the distan es to the root. One sli e orresponds to pli ation. This makes it very appealing as a general ea h subtree (the number and width of the sli es di ers purpose data stru ture for metri sear hing, sin e any a ross the strategies). At ea h subtree, a new pivot is non-expert seeking for a tool to solve his/her parti ular sele ted and so on. The sear h ba ktra ks on the tree problem an use it as a bla k box tool, without the need using the triangle inequality to prune subtrees, that is, of understanding the ompli ations of an area he/she if a is the tree root and b is a hildren orresponding is not interested in. Other data stru tures have many to d(a; b) 2 [x1 ; x2 ℄, then we an avoid entering in the tuning parameters, hen e requiring a big e ort from subtree of b whenever [d(q; a) r; d(q; a) + r℄ has no the user in order to obtain an a eptable performan e. interse tion with [x1 ; x2 ℄. On the other hand, the main weakness of the sa- Several data stru tures use this idea [3, 22, 14, 24, tree is that it is not dynami . That is, on e it is built, 4, 25℄. Clustering algorithms. The se ond trend onsists onstru tion time O(n log2 n= log log n) and sublinear in dividing the spa e in zones as ompa t as possible, sear h time O(n1 (1= log log n) ) in high dimensions and normally re ursively, and storing a representative point O(n ) (0 < < 1) in low dimensions. It is experi- (\ enter") for ea h zone plus a few extra data that mentally shown to improve over other data stru tures permits qui kly dis arding the zone at query time. Two when the dimension is high or the query radius is large. riteria an be used to delimit a zone. For more details see the original referen es [16, 17℄. The rst one is the Voronoi area, where we sele t a set of enters and put ea h other point inside the zone 3.1. Constru tion of its losest enter. The areas are limited by hyper- planes and the zones are analogous to Voronoi regions in ve tor spa es. Let f 1 : : : m g be the set of en- We sele t a random element a 2 S to be the root ters. At query time we evaluate (d(q; 1 ); : : : ; d(q; m )), of the tree. We then sele t a suitable set of neighbors hoose the losest enter and dis ard every zone N (a) satisfying the following property: whose enter i satis es d(q; i ) > d(q; ) + 2r, as its Voronoi area annot interse t with the query ball. Condition 1: (given a; S ) 8x 2 S , x 2 N (a) , The se ond riterion is the overing radius r( i ), 8y 2 N (a) fxg; d(x; y) > d(x; a). whi h is the maximum distan e between i and an el- ement in its zone. If d(q; i ) r > r( i ), then there is That is, the neighbors of a form a set su h that any no need to onsider zone i. neighbor is loser to a than to any other neighbor. The The te hniques an be ombined. Some te hniques \only if" (() part of the de nition guarantees that if using only hyperplanes are [22, 19, 12℄. Some te h- we an get loser to any b 2 S then an element in N (a) niques using only overing radii are [11, 9℄. One using is loser to b than a, be ause we put as dire t neigh- both riteria [5℄. bors all those elements that are not loser to another neighbor. The \if" part ()) aims at putting as few neighbors as possible. Nearest neighbor queries. To answer 1-NN Noti e that the set N (a) is de ned in terms of itself queries, we simulate a range query with a radius that in a non-trivial way and that multiple solutions t the is initially r = 1, and redu e r as we nd loser and de nition. For example, if a is far from b and and loser elements to q . At the end, we have in r the these are lose to ea h other, then both N (a) = fbg distan e to the losest elements and have seen them and N (a) = f g satisfy the de nition. all. Unlike a range query, we are now interested in Finding the smallest possible set N (a) seems to be qui kly nding lose elements in order to redu e r as a nontrivial ombinatorial optimization problem, sin e early as possible, so there are a number of heuristi s to by in luding an element we need to take out others a hieve this. One of the most interesting is proposed in (this happens between b and in the example of the [21℄, where the subtrees yet to be pro essed are stored previous paragraph). However, simple heuristi s whi h in a priority queue in a heuristi ally promising order- add more neighbors than ne essary work well. We be- ing. The traversal is more general than a ba ktra king. gin with the initial node a and its \bag" holding all the Ea h time we pro ess the root of the most promising rest of S . We rst sort the bag by distan e to a. subtree, we may add its hildren to the priority queue. Then, we start adding nodes to N (a) (whi h is ini- At some point we an preempt the sear h using a uto tially empty). Ea h time we onsider a new node b, we riterion given by the triangle inequality. he k whether it is loser to some element of N (a) than k-NN queries are handled as a generalization of 1- to a itself. If that is not the ase, we add b to N (a). NN queries. Instead of one losest element, the k los- At this point we have a suitable set of neighbors. est elements known are maintained, and r is the dis- Note that Condition 1 is satis ed thanks to the fa t tan e to the farthest to q among those k . Ea h time that we have onsidered the elements in order of in- a new andidate appears we insert it into the queue, reasing distan e to a. The \only if" part of Condition whi h may displa e another element and hen e redu e 1 is learly satis ed be ause any element not satisfying r . At the end, the queue ontains the k losest ele- it is inserted in N (a). The \if" part is more deli ate. ments to q . Let x 6= y 2 N (a). If y is loser to a than x then y was onsidered rst. Our onstru tion algorithm guaran- 3. The Spatial Approximation Tree tees that if we inserted x in N (a) then d(x; a) < d(x; y ). If, on the other hand, x is loser to a than y , then We des ribe brie y in this se tion the sa-tree data d(y; x) > d(y; a)  d(x; a) (that is, a neighbor annot stru ture. It needs linear spa e O(n), reasonable be removed by a new neighbor inserted later). We now must de ide in whi h neighbor's bag we The key observation is that, even if q 62 S , the an- put the rest of the nodes. We put ea h node not in swers to the query are elements q 2 S . So we use the 0 fag [ N (a) in the bag of its losest element of N (a) tree to pretend that we are sear hing for an element (best- t strategy). Observe that this requires a se ond q 2 S . We do not know q , but sin e d(q; q )  r , we 0 0 0 pass on e N (a) is fully determined. an obtain from q some distan e information regard- We are done now with a, and pro ess re ursively all ing q : by the triangle inequality it holds that for any 0 its neighbors, ea h one with the elements of its bag. x 2 U , d(x; q ) r  d(x; q )  d(x; q ) + r . 0 Note that the resulting stru ture is a tree that an be Hen e, instead of simply going to the losest neigh- sear hed for any q 2 S by spatial approximation for bor, we rst determine the losest neighbor of q nearest neighbor queries. The reason why this works is among fag [ N (a). We then enter into all neighbors that, at sear h time, we repeat exa tly what happened b 2 N (a) su h that d(q; b)  d(q; ) + 2r . This is be- with q during the onstru tion pro ess (i.e. we enter ause the virtual element q sought an di er from q by 0 into the subtree of the neighbor losest to q ), until we at most r at any distan e evaluation, so it ould have rea h q . This is is be ause q is present in the tree, i.e., been inserted inside any of those b nodes. In the pro- we are doing an exa t sear h after all. ess, we report all the nodes q we found lose enough 0 Finally, we save some omparisons at sear h time by to q . storing at ea h node a its overing radius, i.e. the max- Moreover, noti e that, in an exa t sear h for a q 2 0 imum distan e R(a) between a and any element in the S , the distan es between q and the nodes we traverse 0 subtree rooted by a. The way to use this information get redu ed as we step down the tree. That is, is made lear in Se tion 3.2. Figure 1 depi ts the onstru tion pro ess. It is rstly invoked as BuildTree(a,S fag) where a is Observation 1: Let a; b; 2 S su h that b des ends a random element of S . Note that, ex ept for the rst from a and from b in the tree. Then d( ; b)  d( ; a). level of the re ursion, we already know all the distan es d(v; a) for every v 2 S and hen e do not need to re- ompute them. Similarly, d(v; ) at line 10 is already The same happens, allowing a toleran e of 2r, in a known from line 6. The information stored by the data range sear h with radius r. That is, for any b in the stru ture is the root a and the N () and R() values of path from a to q it holds d(q ; b)  d(q ; a), so d(q; b)  0 0 0 d(q; a) + 2r . Hen e, while at rst we need to enter into all the nodes. all the neighbors b 2 N (a) su h that d(q; b) d(q; )  BuildTree (Node a , Set of nodes S ) 2r, when we enter into those b the toleran e is not 2r anymore but it gets redu ed to 2r (d(q; b) d(q; )). N (a) ; /* neighbors of a */ The overing radius R(a) is used to further prune the R(a) /* overing radius */ 0 sear h, by not entering into subtrees su h that d(q; a) > Sort S by distan e to a ( loser first) R(a) + r , sin e they annot ontain useful elements. for v 2 S do Figure 2 illustrates the sear h pro ess, starting from R(a) max(R(a); d(v; a)) the tree root p11 . Only p9 is in the result, but all the if 8 2b N (a); d(v; a) < d(v; b) bold edges are traversed. Figure 3 gives the sear h al- then N (a) N (a )[f g v gorithm, initially invoked as RangeSear h(a,q ,r,2r), for b 2 N (a ) do S (b ) ; for v 2 S N (a) do where a is the tree root. Note that in the re ursive Let 2 N (a) be the one minimizing d(v; ) invo ations d(a; q ) is already omputed. S ( ) S ( ) [f gv for b 2 N (a ) do BuildTree (b, S (b)) Nearest neighbor sear hing. We an also perform Figure 1. Algorithm to build the sa-tree. nearest neighbor sear hing by simulating a range sear h where the sear h radius is redu ed, just as explained at the end of Se tion 2. We have a priority queue of sub- trees, most promising rst. Initially, we insert the sa- 3.2. Sear hing tree root in the data stru ture. Iteratively, we extra t the most promising subtree, pro ess its root, and insert Of ourse it is of little interest to sear h only for ele- all its subtrees in the queue. This is repeated until the ments q 2 S . The tree we have des ribed an, however, queue gets empty or its most promising subtree an be be used as a devi e to solve queries of any type for any dis arded (i.e., its promise value is bad enough). For q 2 U . We start with range queries with radius r . la k of spa e we omit further details. p3 p12 as queries. The results on these two spa es are rep- resentative of those on many other metri spa es we p2 tested: NASA images, di tionaries in other languages, p4 Gaussian distributions, other dimensions, et . As a omparison point for whi h follows, a stati p11 p6 p10 onstru tion osts about 5 million omparisons for the p7 p9 di tionary and 12.5 million for the ve tor spa e. p14 p15 4.1. Rebuilding the Subtree q p1 p5 p13 The naive approa h rebuilds the whole subtree p8 rooted at a on e a new element x being inserted has to be ome a new neighbor of a. This has the advantage of preserving the same tree that is built stati ally, but, Figure 2. An example of the sear h pro ess. as Figure 4 shows for the ase of the di tionary, the dy- nami onstru tion be omes too ostly in omparison RangeSear h (Node a, Query q , Radius r, to a stati one (140 times more ostly in this example, Toleran e t) almost 230 times more in our ve tor spa e). if d(a; q )  r then Report a if d(a; q )  R(a) + r then Construction cost for n = 69,069 words dmin min fd( ; q ); 2f g[ a N (a)g 8000 Static 2 Distance evaluations (x 10,000) for b N (a) do 7000 Dynamic if d(b; q )dmin  t then 6000 RangeSear h (b,q ,r,t (d(b; q ) ) dmin ) 5000 Figure 3. Sear hing q with radius r in a sa-tree. 4000 3000 2000 4. In remental Constru tion 1000 0 The sa-tree is a stru ture whose onstru tion algo- 10 20 30 40 50 60 70 80 90 100 rithm needs to know all the elements of S in advan e. Percentage of database used In parti ular, it is diÆ ult to add new elements un- der the best- t strategy on e the tree is already built. Figure 4. Constru tion ost by rebuilding subtrees. Ea h time a new element is inserted, we must go down the tree by the losest neighbor until the new element must be ome a neighbor of the urrent node a. All 4.2. Over ow Bu kets the subtree rooted at a must be rebuilt from s rat h, sin e some nodes that went into another neighbor ould We an have an over ow bu ket per node with \ex- prefer now to get into the new neighbor. tra" neighbors that should go in the subtree but have In this se tion we dis uss and empiri ally evaluate not been lassi ed yet. When the new element x must di erent alternatives to permit insertion of new ele- be ome a neighbor of a, we put it in the over ow bu ket ments into an already built sa-tree. For the experi- of a. Ea h time we rea h a at query time, we also ments we have sele ted two metri spa es. The rst is ompare q against its over ow bu ket and report any a di tionary of 69,069 English words, from where we element near enough. randomly hose queries. The distan e in this ase is We must limit the size of the over ow bu kets in the edit distan e, that is, minimum number of har- order to maintain a reasonable sear h eÆ ien y. We a ter insertions, deletions and repla ements to make rebuild a subtree when its over ow bu ket ex eeds a the strings equal. The se ond spa e is the real unitary given size. The main question is whi h is the tradeo ube in dimension 15 using Eu lidean distan e. We in pra ti e between re onstru tion ost and query ost. generated 100,000 random points with uniform distri- As smaller over ow bu kets are permitted, we rebuild bution. For the queries, we build the indexes with 90% the tree more often and improve the query time, but of the points and use the other 10% (randomly hosen) the onstru tion time raises. Construction cost for n = 69,069 Query cost per element for n = 69,069 words 800 60000 Distance evaluations (x 10,000) 700 50000 Distance evaluations 600 40000 500 30000 400 20000 300 10000 Bucket size = 250 200 Bucket size = 500 Bucket size = 1000 100 0 0 100 200 300 400 500 600 700 800 900 1000 1 2 3 4 Size of overflow bucket Search radius Construction cost for n = 100,000 vectors dim. 15 Query cost per element for n = 100,000 vectors dim. 15 1800 90000 Distance evaluations (x 10,000) 1600 85000 Distance evaluations 1400 80000 1200 75000 1000 70000 800 65000 Bucket size = 250 600 60000 Bucket size = 500 Bucket size = 1000 400 55000 0 100 200 300 400 500 600 700 800 900 1000 0.01 0.1 1 Size of overflow bucket Percentage of database retrieved Figure 5. Constru tion osts using over ow bu kets. Figure 6. Sear h osts using over ow bu kets. stati onstru tion provided the orre t bu ket size is Figure 5 shows the ost of the onstru tion using hosen. For example, with bu ket size 500 we obtain di erent bu ket sizes, whi h exhibits interesting u - almost the same sear h osts as for the stati version, at tuations and in some ases osts even less than a stati the modest pri e of 10% extra onstru tion ost for the onstru tion. This is possible be ause many un lassi- di tionary and 30% for the ve tors. The main problem ed elements are left in the bu kets. For example, for in this method is its high sensitivity to the u tuations, bu ket size 1,000, almost all the elements are in over- whi h makes it diÆ ult to sele t a good bu ket size. ow bu kets in the di tionary ase and almost 60% The intermediate bu ket size 500 works well be ause in the ve tor ase. These u tuations appear be ause at this point the elements in over ow bu kets are 30% a larger bu ket size may produ e more rebuilds than in the di tionary and 15% in the ve tors. a smaller one for a given set size n. The e e t is well known, for example it appears when studying the num- 4.3. A First-Fit Strategy ber of splits as a fun tion of the B-tree page size [1℄. Figure 6 shows the sear h osts using over ow bu k- Yet another solution is to hange our best- t strategy ets. We sear hed with xed radius 1 to 4 in the di tio- to put elements inside the bags of the neighbors of a nary example and with radii retrieving 0.01%, 0.1% at onstru tion time. An alternative, rst- t, is to put and 1% of the set in the ve tor example. We also ea h node in the bag of the rst neighbor loser than performed nearest neighbor sear h experiments, whi h a to q. Determining N (a) and the bag of ea h other yielded similar results and are omitted for la k of spa e. element an now be done all in one pass. As an be seen by omparing the results to those With the rst- t strategy, however, we an easily of Figure 8, this te hnique is ompetitive against the add more elements by pretending that the new in om- Construction cost for n = 69,069 words Query cost per element for n = 69,069 words 1600 50000 Static Distance evaluations (x 10,000) 1400 First-Fit 45000 Timestamp (up) 40000 Distance evaluations 1200 Timestamp (down) 35000 1000 30000 800 25000 600 20000 15000 400 Best-Fit 10000 First-Fit 200 5000 Timestamp (up) Timestamp (down) 0 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 Percentage of database used Search radius Construction cost for n = 100,000 vectors dim. 15 Query cost per element for n = 100,000 vectors dim. 15 1400 90000 Static Distance evaluations (x 10,000) 1200 First-Fit 85000 Timestamp (down) Distance evaluations Timestamp (up) 80000 1000 75000 800 70000 600 65000 400 60000 Best-Fit 200 55000 First-Fit Timestamp 0 50000 10 20 30 40 50 60 70 80 90 100 0.01 0.1 1 Percentage of database used Percentage of database retrieved Figure 7. Constru tion osts using rst- t and using Figure 8. Sear h osts using rst- t and the two timestamps. versions of the timestamping te hnique. ing element x was the last one in the bag, whi h means Figure 8 shows sear h times. As an be seen, the that when it be omes a neighbor of a it an be simply sear h overhead of the rst- t strategy is too high, at a added as the last neighbor of a, and there were no later point that makes the stru ture not ompetitive against elements that had the han e of getting into x. This other existing ones. allows building the stru ture by su essive insertions. Figure 7 shows that the onstru tion (stati or dy- 4.4. Timestamping nami ) using rst- t is mu h heaper than using best- t. Moreover, rst- t osts exa tly the same and pro- An alternative that has resemblan es with the two du es the same tree in the stati or the dynami ase. previous but is more sophisti ated onsists in keeping a Range sear hing under the rst- t strategy is a lit- timestamp of the insertion time of ea h element. When tle di erent. We onsider the neighbors fv1 ; : : : ; vk g inserting a new element, we add it as a neighbor at the of a in order. We perform the minimization while appropriate point using best- t and do not rebuild the we traverse the neighbors. That is, we enter into tree. Let us onsider that neighbors are added at the the subtree of v1 if d(q; v1 )  d(q; a) + 2r; into the end, so by reading them left to right we have in reasing subtree of v2 if d(q; v2 )  min(d(q; a); d(q; v1 )) + 2r; insertion times. It also holds that the parent is always and in general into the subtree of vi if d(q; vi )  older than its hildren. min(d(q; a); d(q; v1 ); : : : ; d(q; vi 1 )) + 2r. This is be- As seen in Figure 7, this alternative an ost a bit ause vi+j an never take out an element from vi . more or a bit less than stati best- t depending on the ase. Two versions of this methods, labeled \up" and namism. In the ase of the di tionary, the timestamp- \down" in the plot, orrespond to how to handle the ing te hnique is signi antly worse than the stati one ase of equal distan es to the root and to the losest (although the \up" behaves slightly better for nearest neighbor when inserting a new element. The former neighbor sear hing). The problem is that the \up" ver- inserts the element as a new neighbor and the latter sion is mu h more ostly to build, needing more than sends it to the subtree of the losest neighbor. This 3 times the stati onstru tion ost. makes a di eren e only in dis rete distan es. At sear h time, we onsider the neighbors fv1 ; : : : ; vk g of a from oldest to newest. We perform 4.5. Inserting at the Fringe the minimization while we traverse the neighbors, ex- a tly as in Se tion 4.3. This is be ause between the insertion of vi and vi+j there may have appeared new Yet another alternative is as follows. We an relax elements that preferred vi just be ause vi+j was not Condition 1 (Se tion 3.1), whose main goal is to guar- yet a neighbor, so we may miss an element if we do not antee that if q is loser to a than to any neighbor in enter into vi be ause of the existen e of vi+j . N (a) then we an stop the sear h at that point. The Note that, although the sear h pro ess is the same as idea is that, at sear h time, instead of nding the los- under rst- t, the insertion puts the elements into their est among fag[ N (a) and entering into any b 2 N (a) losest neighbor, so the stru ture is more balan ed. su h that d(q; b)  d(q; ) + 2r, we ex lude the sub- Up to now we do not really need timestamps but tree root fag from the minimization. Hen e, we always just to keep the neighbors sorted. Yet a more so- ontinue to the leaves by the losest neighbor and oth- phisti ated s heme is to use the timestamps to re- ers lose enough. This seems to make the sear h time du e the work done inside older neighbors. Say that slightly worse, but the ost is marginal. d(q; vi ) > d(q; vi+j ) + 2r . We have to enter into vi The bene t is that we are not for ed anymore to put be ause it is older. However, only the elements with a new inserted element x as a neighbor of a, even when timestamp smaller than that of vi+j should be onsid- Condition 1 would require it. That is, at insertion time, ered when sear hing inside vi ; younger elements have even if x is loser to a than to any element in N (a), we seen vi+j and they annot be interesting for the sear h have the hoi e of not putting it as a neighbor of a but if they are inside vi . As parent nodes are older than inserting it into its losest neighbor of N (a). At sear h their des endants, as soon as we nd a node inside the time we will rea h x be ause the sear h and insertion subtree of vi with timestamp larger than that of vi+j pro esses are similar. we an stop the sear h in that bran h, be ause its sub- This freedom opens a number of new possibilities tree is even younger. that deserve a mu h deeper study, but an immediate An alternative view, equivalent as before but fo us- onsequen e is that we an insert always at the leaves ing on maximum allowed radius instead of maximum of the tree. Hen e, the tree is read-only in its top part allowed timestamp, is as follows. Ea h time we enter and it hanges only in the fringe. into a subtree y of vi , we sear h for the siblings vi+j of vi that are older than y . Over this set, we ompute However, we have to permit the re onstru tion of the maximum radius that permits to avoid pro essing small subtrees so as to avoid that the tree be omes y , namely ry = max(d(q; vi ) d(q; vi+j ))=2. If it holds almost a linked list. So we permit inserting x as a r < ry , we do not need to enter into the subtree y . neighbor when the size of the subtree to rebuild is small Let us now onsider nearest neighbor sear hing. As- enough, whi h leads to a tradeo between insertion ost sume that we are urrently pro essing node vi and in- and quality of the tree at sear h time. sert its hildren y in the priority queue. We ompute Figure 9 shows the onstru tion ost for di erent ry as before and insert it together with y in the priority maximum tree sizes that an be rebuilt. As an be seen, queue. Later, when the time to pro ess y omes, we permitting a tree size of 50 yields the same onstru tion onsider our urrent sear h radius r and dis ard y if ost of the stati version.  < r . If we insert a hildren z of y , we put it the r y Finally, Figure 10 shows the sear h times using this value min(ry ; rz ). te hnique. As an be seen, using a tree size of 50 per- Figure 8 ompares this te hnique against the stati mits the same and even better sear h time ompared one. As it an be seen, this is an ex ellent alterna- to the stati version, whi h shows that it may be benef- tive to the stati onstru tion in the ase of our ve - i al to move elements downward in the tree. This fa t tor spa e example, providing basi ally the same on- makes this alternative a very interesting hoi e deserv- stru tion and sear h ost with the added value of dy- ing more study. Construction cost for n = 69,069 words Query cost per element for n = 69,069 words 2200 45000 Size = 10 Distance evaluations (x 10,000) 2000 40000 Size = 50 1800 Size = 100 Distance evaluations 35000 Size = 500 1600 30000 Size = 1000 1400 Best-Fit 25000 1200 20000 1000 15000 800 600 10000 400 5000 200 0 0 100 200 300 400 500 600 700 800 900 1000 1 2 3 4 Tree size allowed to reconstruct Search radius Construction cost for n = 100,000 vectors dim. 15 Query cost per element for n = 100,000 vectors dim. 15 7000 85000 Distance evaluations (x 10,000) 6000 80000 Distance evaluations 75000 5000 70000 4000 65000 3000 60000 Size = 10 55000 Size = 50 2000 50000 Size = 100 Size = 500 1000 45000 Size = 1000 Best-Fit 0 40000 0 100 200 300 400 500 600 700 800 900 1000 0.01 0.1 1 Tree size allowed to recontruct Percentage of database retrieved Figure 9. Constru tion osts inserting at the fringe. Figure 10. Sear h osts using insertion in the fringe. 5. Con lusions Other alternatives, su h as rebuilding and rst- t, proved to be not ompetitive, although the latter o ers We have presented several te hniques to modify the very low onstru tion osts, whi h ould be interesting sa-tree in order to make it a dynami data stru ture despite its mu h higher sear h ost. supporting insertions, without degrading its urrent It is lear now that making the sa-tree dynami is af- performan e. We have shown that there are many fordable and that the stru ture an even be improved in more alternatives than what appears at a rst glan e, a dynami setup, ontrary to our previous assumption and that the invariants of the sa-tree an be relaxed in that there would be a ost for the dynamism. On the ways unforeseen before this study (e.g. the fa t that other hand, we need to pursue more in the most promis- we an de ide whether or not to add neighbors). ing alternatives in order to understand them better. From the hoi es we have onsidered, the use of over- Moreover, we have not onsidered deletions yet. These ow bu kets shows that it is possible to obtain on- seem more diÆ ult but always an be treated by mark- stru tion and sear h times similar to those of the stati ing the nodes as deleted and making periodi rebuilds. version, although the hoi e of the bu ket size deserves This work is a rst step of a broader proje t [20℄ more study. Timestamping has also shown ompetitive whi h aims at a fully dynami data stru ture for sear h- in some metri spa es and not so attra tive in others, ing in metri spa es, whi h an also work on se ondary a fa t deserving more study. Finally, inserting at the memory. We have not tou hed this last aspe t in this fringe has shown the potential of even improving the paper. A simple solution to store the sa-tree in se - performan e of the stati version, although studying ondary storage is to try to store whole subtrees in disk the e e t of the size of the fringe is required. pages so as to minimize the number of pages read at sear h time. This has an interesting relationship with [14℄ L. Mi o, J. On ina, and R. Carras o. A fast bran h inserting at the fringe (Se tion 4.5), not only be ause and bound nearest neighbor lassi er in metri spa es. the top part of the tree is read-only, but also be ause Pattern Re ognition Letters, 17:731{739, 1996. Else- we an ontrol the maximum arity of the tree so as to vier. make the neighbors t in a disk page. [15℄ L. Mi o, J. On ina, and E. Vidal. A new version of the nearest-neighbor approximating and eliminating sear h (aesa) with linear prepro essing-time and mem- Referen es ory requirements. Pattern Re ognition Letters, 15:9{ 17, 1994. Elsevier. [16℄ G. Navarro. Sear hing in metri spa es by spatial [1℄ R. Baeza-Yates and P. Larson. Performan e of B+ - approximation. In Pro . 6th International Sympo- trees with Partial Expansions. IEEE Transa tions on sium on String Pro essing and Information Retrieval Knowledge and Data Engineering, 1(2):248{257, 1989. (SPIRE'99), pages 141{148. IEEE CS Press, 1999. [2℄ J. Bentley. Multidimensional binary sear h trees in [17℄ G. Navarro. Sear hing in metri spa es by spatial database appli ations. IEEE Transa tions on Software approximation. Te hni al Report TR/DCC-2001-4, Engineering, 5(4):333{340, 1979. Dept. of Computer S ien e, Univ. of Chile, 2001. [3℄ W. Burkhard and R. Keller. Some approa hes to best- ftp://ftp.d .u hile. l/pub/users/gnavarro/ mat h le sear hing. Communi ations of the ACM, jsat.ps.gz. 16(4):230{236, 1973. [18℄ S. Nene and S. Nayar. A simple algorithm for nearest [4℄ T. Bozkaya and M. Ozsoyoglu. Distan e-based index- neighbor sear h in high dimensions. IEEE Transa - ing for high-dimensional metri spa es. In Pro . ACM tions on Pattern Analysis and Ma hine Intelligen e, Conferen e on Management of Data (SIGMOD'97), 19(9):989{1003, 1997. pages 357{368, 1997. Sigmod Re ord 26(2). [19℄ H. Nolteimer, K. Verbarg, and C. Zirkelba h. [5℄ S. Brin. Near neighbor sear h in large metri spa es. In Monotonous Bise tor Trees { a tool for eÆ ient par- Pro . of the 21st Conferen e on Very Large Databases titioning of omplex s henes of geometri obje ts. In (VLDB'95), pages 574{584, 1995. Data Stru tures and EÆ ient Algorithms, LNCS 594, [6℄ R. Baeza-Yates, W. Cunto, U. Manber, and S. Wu. pages 186{203, 1992. Proximity mat hing using xed-queries trees. In Pro . [20℄ N. Reyes. Dynami data stru tures for sear hing met- 5th Conferen e on Combinatorial Pattern Mat hing ri spa es. MS . Thesis, Univ. Na . de San Luis, Ar- (CPM'94), LNCS 807, pages 198{212, 1994. gentina, 2001. In progress. G. Navarro, advisor. [7℄ E. Chavez, J. Marroqun, and R. Baeza-Yates. [21℄ J. Uhlmann. Implementing metri trees to satisfy gen- Spaghettis: an array based algorithm for similarity eral proximity/similarity queries. Manus ript, 1991. queries in metri spa es. In Pro . 6th International [22℄ J. Uhlmann. Satisfying general proximity/similarity Symposium on String Pro essing and Information Re- queries with metri trees. Information Pro essing Let- trieval (SPIRE'99), pages 38{46. IEEE CS Press, ters, 40:175{179, 1991. Elsevier. 1999. [23℄ E. Vidal. An algorithm for nding nearest neighbors [8℄ E. Chavez, J. Marroqun, and G. Navarro. Fixed in (approximately) onstant average time. Pattern queries array: A fast and e onomi al data stru ture Re ognition Letters, 4:145{157, 1986. for proximity sear hing. Multimedia Tools and Appli- [24℄ P. Yianilos. Data stru tures and algorithms for near- ations, 14(2):113{135, 2001. Kluwer. est neighbor sear h in general metri spa es. In Pro . [9℄ E. Chavez and G. Navarro. An e e tive lustering al- 4th ACM-SIAM Symposium on Dis rete Algorithms gorithm to index high dimensional metri spa es. In (SODA'93), pages 311{321, 1993. Pro . 7th International Symposium on String Pro ess- [25℄ P. Yianilos. Lo ally lifting the urse of dimensionality for nearest neighbor sear h. In Pro . 11th ACM-SIAM ing and Information Retrieval (SPIRE'00), pages 75{ Symposium on Dis rete Algorithms (SODA'00), 2000. 86. IEEE CS Press, 2000. [10℄ E. Chavez, G. Navarro, R. Baeza-Yates, and J. Mar- roqun. Sear hing in metri spa es. ACM Computing Surveys, 2001. To appear. [11℄ P. Cia ia, M. Patella, and P. Zezula. M-tree: an eÆ ient a ess method for similarity sear h in metri spa es. In Pro . of the 23rd Conferen e on Very Large Databases (VLDB'97), pages 426{435, 1997. [12℄ F. Dehne and H. Nolteimer. Voronoi trees and lus- tering problems. Information Systems, 12(2):171{175, 1987. Pergamon Journals. [13℄ A. Guttman. R-trees: a dynami index stru ture for spatial sear hing. In Pro . ACM Conferen e on Man- agement of Data (SIGMOD'84), pages 47{57, 1984.