Dynami
Spatial Approximation Trees
Gonzalo Navarro Nora Reyes
Dept. of Computer S
ien
e Depto. de Informati
a
University of Chile Universidad Na
ional de San Luis
Blan
o En
alada 2120, Santiago, Chile Ejer
ito de los Andes 950, San Luis, Argentina
gnavarrod
.u
hile.
l nreyesunsl.edu.ar
Abstra
t quantization and
ompression (where only some ve
-
tors
an be represented and we
ode the others as their
The Spatial Approximation Tree (sa-tree) is a re-
losest representable point);
omputational biology (to
ently proposed data stru
ture for sear
hing in metri
nd a DNA or protein sequen
e in a database allowing
spa
es. It has been shown that it
ompares favorably some errors due to mutations); and fun
tion predi
tion
against alternative data stru
tures in spa
es of high (to sear
h for the most similar behavior of a fun
tion in
dimension or queries with low sele
tivity. The main the past so as to predi
t its probable future behavior).
drawba
k of the sa-tree is that it is a stati
data stru
- All those appli
ations have some
ommon
hara
-
ture, that is, on
e built, it is diÆ
ult to add new el- teristi
s. There is a universe U of obje
ts, and a non-
ements to it. This rules it out for many interesting negative distan
e fun
tion d : U U ! R+ dened
appli
ations. among them. This distan
e satises the three axioms
In this paper we over
ome this weakness. We pro- that make the set a metri
spa
e: stri
t positiveness
pose and study several methods to handle insertions (d(x; y ) = 0 , x = y ), symmetry (d(x; y ) = d(y; x))
in the sa-tree. Some are
lassi
al folklore solutions and triangle inequality (d(x; z ) d(x; y ) + d(y; z )).
well known in the data stru
tures
ommunity, while the The smaller the distan
e between two obje
ts, the
most promising ones have been spe
i
ally developed more \similar" they are. We have a nite database
onsidering the parti
ular properties of the sa-tree, and S U , whi
h is a subset of the universe of obje
ts and
involve new algorithmi
insights in the behavior of this
an be prepro
essed (to build an index, for example).
data stru
ture. As a result, we show that it is viable to Later, given a new obje
t from the universe (a query
modify the sa-tree so as to permit fast insertions while q ), we must retrieve all similar elements found in the
keeping its good sear
h eÆ
ien
y. database. There are two typi
al queries of this kind:
Range query: Retrieve all elements within distan
e
r to q in S . This is, fx 2 S ; d(x; q ) rg.
1. Introdu
tion
Nearest neighbor query (k -NN): Retrieve the k
losest elements to q in S . That is, a set A S
The
on
ept of \approximate" sear
hing has appli- su
h that jAj = k and 8x 2 A; y 2 S A; d(x; q )
ations in a vast number of elds. Some examples are d(y; q ).
non-traditional databases (e.g. storing images, nger-
prints or audio
lips, where the
on
ept of exa
t sear
h The distan
e is
onsidered expensive to
ompute
is of no use and we sear
h instead for similar obje
ts); (think, for instan
e, in
omparing two ngerprints).
text sear
hing (to nd words and phrases in a text Hen
e, it is
ustomary to dene the
omplexity of
database allowing a small number of typographi
al or the sear
h as the number of distan
e evaluations per-
spelling errors); information retrieval (to look for do
- formed, disregarding other
omponents su
h as CPU
uments that are similar to a given query or do
ument); time for side
omputations, and even I/O time. Given
ma
hine learning and
lassi
ation (to
lassify a new a database of jS j = n obje
ts, queries
an be trivially
element a
ording to its
losest representative); image answered by performing n distan
e evaluations. The
goal is to stru
ture the database su
h that we perform
Partially supported by Fonde
yt grant 1-000929. less distan
e evaluations.
A parti
ular
ase of this problem arises when the it is diÆ
ult to add new elements to it. This makes
spa
e is a set of d-dimensional points and the dis- the sa-tree unsuitable for dynami
appli
ations su
h as
P
tan
e belongs to the Minkowski Lp family: Lp =
( 1id jxi yi jp )1=p . The best known spe
ial
ases
multimedia databases.
Over
oming this weakness is the aim of this paper.
are p = 1 (Manhattan distan
e), p = 2 (Eu
lidean We propose and study several methods to handle inser-
distan
e) and p = 1 (maximum distan
e), that is, tions in the sa-tree. Some are
lassi
al folklore solutions
L1 = max1id jxi y i j. well known in the data stru
tures
ommunity, while the
There are ee
tive methods to sear
h on d- most promising ones have been spe
i
ally developed
dimensional spa
es, su
h as kd-trees [2℄ or R-trees [13℄.
onsidering the parti
ular properties of the sa-tree. As
However, for roughly 20 dimensions or more those a result, we show that it is viable to modify the sa-tree
stru
tures
ease to work well. We fo
us in this paper in so as to permit fast insertions while keeping its good
general metri
spa
es, although the solutions are well sear
h eÆ
ien
y. As a related byprodu
t of this study,
suited also for d-dimensional spa
es. we give new algorithmi
insights in the behavior of this
It is interesting to noti
e that the
on
ept of \di- data stru
ture.
mensionality"
an be translated to metri
spa
es as
well: the typi
al feature in high dimensional spa
es
2. Previous Work
with Lp distan
es is that the probability distribution
of distan
es among elements has a very
on
entrated
histogram (with larger mean as the dimension grows), Algorithms to sear
h in general metri
spa
es
an be
making the work of any similarity sear
h algorithm divided in two large areas: pivot-based and
lustering
more diÆ
ult [5, 10℄. In the extreme
ase we have a algorithms. (See [10℄ for a more
omplete review.)
spa
e where d(x; x) = 0 and 8y 6= x; d(x; y ) = 1, where
it is impossible to avoid a single distan
e evaluation at
sear
h time. We say that a general metri
spa
e is high Pivot-based algorithms. The idea is to use a set
dimensional when its histogram of distan
es is
on
en- of k distinguished elements (\pivots") p1 :::pk 2 S
trated. and storing, for ea
h database element x, its dis-
There are a number of methods to prepro
ess the set tan
e to the k pivots (d(x; p1 ):::d(x; pk )). Given the
in order to redu
e the number of distan
e evaluations. query q , its distan
e to the k pivots is
omputed
All those stru
tures work on the basis of dis
arding (d(q; p1 ):::d(q; pk )). Now, if for some pivot pi it holds
elements using the triangle inequality, and most use that jd(q; pi ) d(x; pi )j > r, then we know by the tri-
the
lassi
al divide-and-
onquer approa
h (whi
h is not angle inequality that d(q; x) > r and therefore do not
spe
i
of metri
spa
e sear
hing). need to expli
itly evaluate d(x; p). All the other el-
The Spatial Approximation Tree (sa-tree) is a re- ements that
annot be eliminated using this rule are
ently proposed data stru
ture of this kind [16℄, whi
h dire
tly
ompared against the query.
is based on a novel
on
ept: rather than dividing the Several algorithms [23, 15, 7, 18, 6, 8℄ are almost
sear
h spa
e, approa
h the query spatially, that is, dire
t implementations of this idea, and dier basi
ally
start at some point in the spa
e and get
loser and in their extra stru
ture used to redu
e the CPU
ost of
loser to the query. It has been shown that the sa- nding the
andidate points, but not in their number
tree behaves better than the other existing stru
tures of distan
e evaluations.
on metri
spa
es of high dimension or queries with low There are a number of tree-like data stru
tures that
sele
tivity, whi
h is the
ase in many appli
ations. use this idea in a more indire
t way: they sele
t a pivot
The sa-tree, unlike other data stru
tures, does not as the root of the tree and divide the spa
e a
ording
have parameters to be tuned by the user of ea
h ap- to the distan
es to the root. One sli
e
orresponds to
pli
ation. This makes it very appealing as a general ea
h subtree (the number and width of the sli
es diers
purpose data stru
ture for metri
sear
hing, sin
e any a
ross the strategies). At ea
h subtree, a new pivot is
non-expert seeking for a tool to solve his/her parti
ular sele
ted and so on. The sear
h ba
ktra
ks on the tree
problem
an use it as a bla
k box tool, without the need using the triangle inequality to prune subtrees, that is,
of understanding the
ompli
ations of an area he/she if a is the tree root and b is a
hildren
orresponding
is not interested in. Other data stru
tures have many to d(a; b) 2 [x1 ; x2 ℄, then we
an avoid entering in the
tuning parameters, hen
e requiring a big eort from subtree of b whenever [d(q; a) r; d(q; a) + r℄ has no
the user in order to obtain an a
eptable performan
e. interse
tion with [x1 ; x2 ℄.
On the other hand, the main weakness of the sa- Several data stru
tures use this idea [3, 22, 14, 24,
tree is that it is not dynami
. That is, on
e it is built, 4, 25℄.
Clustering algorithms. The se
ond trend
onsists
onstru
tion time O(n log2 n= log log n) and sublinear
in dividing the spa
e in zones as
ompa
t as possible, sear
h time O(n1 (1= log log n) ) in high dimensions and
normally re
ursively, and storing a representative point O(n ) (0 < < 1) in low dimensions. It is experi-
(\
enter") for ea
h zone plus a few extra data that mentally shown to improve over other data stru
tures
permits qui
kly dis
arding the zone at query time. Two when the dimension is high or the query radius is large.
riteria
an be used to delimit a zone. For more details see the original referen
es [16, 17℄.
The rst one is the Voronoi area, where we sele
t a
set of
enters and put ea
h other point inside the zone
3.1. Constru
tion
of its
losest
enter. The areas are limited by hyper-
planes and the zones are analogous to Voronoi regions
in ve
tor spa
es. Let f
1 : : :
m g be the set of
en- We sele
t a random element a 2 S to be the root
ters. At query time we evaluate (d(q;
1 ); : : : ; d(q;
m )), of the tree. We then sele
t a suitable set of neighbors
hoose the
losest
enter
and dis
ard every zone N (a) satisfying the following property:
whose
enter
i satises d(q;
i ) > d(q;
) + 2r, as its
Voronoi area
annot interse
t with the query ball. Condition 1: (given a; S ) 8x 2 S , x 2 N (a) ,
The se
ond
riterion is the
overing radius
r(
i ), 8y 2 N (a) fxg; d(x; y) > d(x; a).
whi
h is the maximum distan
e between
i and an el-
ement in its zone. If d(q;
i ) r >
r(
i ), then there is That is, the neighbors of a form a set su
h that any
no need to
onsider zone i. neighbor is
loser to a than to any other neighbor. The
The te
hniques
an be
ombined. Some te
hniques \only if" (() part of the denition guarantees that if
using only hyperplanes are [22, 19, 12℄. Some te
h- we
an get
loser to any b 2 S then an element in N (a)
niques using only
overing radii are [11, 9℄. One using is
loser to b than a, be
ause we put as dire
t neigh-
both
riteria [5℄. bors all those elements that are not
loser to another
neighbor. The \if" part ()) aims at putting as few
neighbors as possible.
Nearest neighbor queries. To answer 1-NN Noti
e that the set N (a) is dened in terms of itself
queries, we simulate a range query with a radius that in a non-trivial way and that multiple solutions t the
is initially r = 1, and redu
e r as we nd
loser and denition. For example, if a is far from b and
and
loser elements to q . At the end, we have in r the these are
lose to ea
h other, then both N (a) = fbg
distan
e to the
losest elements and have seen them and N (a) = f
g satisfy the denition.
all. Unlike a range query, we are now interested in Finding the smallest possible set N (a) seems to be
qui
kly nding
lose elements in order to redu
e r as a nontrivial
ombinatorial optimization problem, sin
e
early as possible, so there are a number of heuristi
s to by in
luding an element we need to take out others
a
hieve this. One of the most interesting is proposed in (this happens between b and
in the example of the
[21℄, where the subtrees yet to be pro
essed are stored previous paragraph). However, simple heuristi
s whi
h
in a priority queue in a heuristi
ally promising order- add more neighbors than ne
essary work well. We be-
ing. The traversal is more general than a ba
ktra
king. gin with the initial node a and its \bag" holding all the
Ea
h time we pro
ess the root of the most promising rest of S . We rst sort the bag by distan
e to a.
subtree, we may add its
hildren to the priority queue. Then, we start adding nodes to N (a) (whi
h is ini-
At some point we
an preempt the sear
h using a
uto tially empty). Ea
h time we
onsider a new node b, we
riterion given by the triangle inequality.
he
k whether it is
loser to some element of N (a) than
k-NN queries are handled as a generalization of 1- to a itself. If that is not the
ase, we add b to N (a).
NN queries. Instead of one
losest element, the k
los- At this point we have a suitable set of neighbors.
est elements known are maintained, and r is the dis- Note that Condition 1 is satised thanks to the fa
t
tan
e to the farthest to q among those k . Ea
h time that we have
onsidered the elements in order of in-
a new
andidate appears we insert it into the queue,
reasing distan
e to a. The \only if" part of Condition
whi
h may displa
e another element and hen
e redu
e 1 is
learly satised be
ause any element not satisfying
r . At the end, the queue
ontains the k
losest ele- it is inserted in N (a). The \if" part is more deli
ate.
ments to q . Let x 6= y 2 N (a). If y is
loser to a than x then y was
onsidered rst. Our
onstru
tion algorithm guaran-
3. The Spatial Approximation Tree tees that if we inserted x in N (a) then d(x; a) < d(x; y ).
If, on the other hand, x is
loser to a than y , then
We des
ribe brie
y in this se
tion the sa-tree data d(y; x) > d(y; a) d(x; a) (that is, a neighbor
annot
stru
ture. It needs linear spa
e O(n), reasonable be removed by a new neighbor inserted later).
We now must de
ide in whi
h neighbor's bag we The key observation is that, even if q 62 S , the an-
put the rest of the nodes. We put ea
h node not in swers to the query are elements q 2 S . So we use the
0
fag [ N (a) in the bag of its
losest element of N (a) tree to pretend that we are sear
hing for an element
(best-t strategy). Observe that this requires a se
ond q 2 S . We do not know q , but sin
e d(q; q ) r , we
0 0 0
pass on
e N (a) is fully determined.
an obtain from q some distan
e information regard-
We are done now with a, and pro
ess re
ursively all ing q : by the triangle inequality it holds that for any
0
its neighbors, ea
h one with the elements of its bag. x 2 U , d(x; q ) r d(x; q ) d(x; q ) + r .
0
Note that the resulting stru
ture is a tree that
an be Hen
e, instead of simply going to the
losest neigh-
sear
hed for any q 2 S by spatial approximation for bor, we rst determine the
losest neighbor
of q
nearest neighbor queries. The reason why this works is among fag [ N (a). We then enter into all neighbors
that, at sear
h time, we repeat exa
tly what happened b 2 N (a) su
h that d(q; b) d(q;
) + 2r . This is be-
with q during the
onstru
tion pro
ess (i.e. we enter
ause the virtual element q sought
an dier from q by
0
into the subtree of the neighbor
losest to q ), until we at most r at any distan
e evaluation, so it
ould have
rea
h q . This is is be
ause q is present in the tree, i.e., been inserted inside any of those b nodes. In the pro-
we are doing an exa
t sear
h after all.
ess, we report all the nodes q we found
lose enough
0
Finally, we save some
omparisons at sear
h time by to q .
storing at ea
h node a its
overing radius, i.e. the max- Moreover, noti
e that, in an exa
t sear
h for a q 2 0
imum distan
e R(a) between a and any element in the S , the distan
es between q and the nodes we traverse
0
subtree rooted by a. The way to use this information get redu
ed as we step down the tree. That is,
is made
lear in Se
tion 3.2.
Figure 1 depi
ts the
onstru
tion pro
ess. It is
rstly invoked as BuildTree(a,S fag) where a is Observation 1: Let a; b;
2 S su
h that b des
ends
a random element of S . Note that, ex
ept for the rst from a and
from b in the tree. Then d(
; b) d(
; a).
level of the re
ursion, we already know all the distan
es
d(v; a) for every v 2 S and hen
e do not need to re-
ompute them. Similarly, d(v;
) at line 10 is already The same happens, allowing a toleran
e of 2r, in a
known from line 6. The information stored by the data range sear
h with radius r. That is, for any b in the
stru
ture is the root a and the N () and R() values of path from a to q it holds d(q ; b) d(q ; a), so d(q; b)
0 0 0
d(q; a) + 2r . Hen
e, while at rst we need to enter into
all the nodes.
all the neighbors b 2 N (a) su
h that d(q; b) d(q;
)
BuildTree (Node a , Set of nodes S ) 2r, when we enter into those b the toleran
e is not 2r
anymore but it gets redu
ed to 2r (d(q; b) d(q;
)).
N (a) ;
/* neighbors of a */ The
overing radius R(a) is used to further prune the
R(a) /*
overing radius */
0 sear
h, by not entering into subtrees su
h that d(q; a) >
Sort S by distan
e to a (
loser first) R(a) + r , sin
e they
annot
ontain useful elements.
for v 2 S do Figure 2 illustrates the sear
h pro
ess, starting from
R(a) max(R(a); d(v; a))
the tree root p11 . Only p9 is in the result, but all the
if 8 2b N (a); d(v; a) < d(v; b)
bold edges are traversed. Figure 3 gives the sear
h al-
then N (a) N (a )[f g
v
gorithm, initially invoked as RangeSear
h(a,q ,r,2r),
for b 2 N (a ) do S (b ) ;
for v 2 S N (a) do where a is the tree root. Note that in the re
ursive
Let
2 N (a) be the one minimizing d(v;
) invo
ations d(a; q ) is already
omputed.
S (
) S (
) [f gv
for b 2 N (a ) do BuildTree (b, S (b))
Nearest neighbor sear
hing. We
an also perform
Figure 1. Algorithm to build the sa-tree. nearest neighbor sear
hing by simulating a range sear
h
where the sear
h radius is redu
ed, just as explained at
the end of Se
tion 2. We have a priority queue of sub-
trees, most promising rst. Initially, we insert the sa-
3.2. Sear
hing tree root in the data stru
ture. Iteratively, we extra
t
the most promising subtree, pro
ess its root, and insert
Of
ourse it is of little interest to sear
h only for ele- all its subtrees in the queue. This is repeated until the
ments q 2 S . The tree we have des
ribed
an, however, queue gets empty or its most promising subtree
an be
be used as a devi
e to solve queries of any type for any dis
arded (i.e., its promise value is bad enough). For
q 2 U . We start with range queries with radius r . la
k of spa
e we omit further details.
p3
p12
as queries. The results on these two spa
es are rep-
resentative of those on many other metri
spa
es we
p2 tested: NASA images, di
tionaries in other languages,
p4 Gaussian distributions, other dimensions, et
.
As a
omparison point for whi
h follows, a stati
p11 p6 p10
onstru
tion
osts about 5 million
omparisons for the
p7
p9 di
tionary and 12.5 million for the ve
tor spa
e.
p14
p15 4.1. Rebuilding the Subtree
q
p1
p5
p13
The naive approa
h rebuilds the whole subtree
p8 rooted at a on
e a new element x being inserted has to
be
ome a new neighbor of a. This has the advantage
of preserving the same tree that is built stati
ally, but,
Figure 2. An example of the sear
h pro
ess.
as Figure 4 shows for the
ase of the di
tionary, the dy-
nami
onstru
tion be
omes too
ostly in
omparison
RangeSear
h (Node a, Query q , Radius r, to a stati
one (140 times more
ostly in this example,
Toleran
e t) almost 230 times more in our ve
tor spa
e).
if d(a; q ) r then Report a
if d(a; q ) R(a) + r then
Construction cost for n = 69,069 words
dmin min fd(
; q );
2f g[ a N (a)g 8000
Static
2
Distance evaluations (x 10,000)
for b N (a) do 7000 Dynamic
if d(b; q )dmin t then
6000
RangeSear
h (b,q ,r,t (d(b; q ) )
dmin )
5000
Figure 3. Sear
hing q with radius r in a sa-tree. 4000
3000
2000
4. In
remental Constru
tion
1000
0
The sa-tree is a stru
ture whose
onstru
tion algo- 10 20 30 40 50 60 70 80 90 100
rithm needs to know all the elements of S in advan
e. Percentage of database used
In parti
ular, it is diÆ
ult to add new elements un-
der the best-t strategy on
e the tree is already built. Figure 4. Constru
tion
ost by rebuilding subtrees.
Ea
h time a new element is inserted, we must go down
the tree by the
losest neighbor until the new element
must be
ome a neighbor of the
urrent node a. All 4.2. Over
ow Bu
kets
the subtree rooted at a must be rebuilt from s
rat
h,
sin
e some nodes that went into another neighbor
ould We
an have an over
ow bu
ket per node with \ex-
prefer now to get into the new neighbor. tra" neighbors that should go in the subtree but have
In this se
tion we dis
uss and empiri
ally evaluate not been
lassied yet. When the new element x must
dierent alternatives to permit insertion of new ele- be
ome a neighbor of a, we put it in the over
ow bu
ket
ments into an already built sa-tree. For the experi- of a. Ea
h time we rea
h a at query time, we also
ments we have sele
ted two metri
spa
es. The rst is
ompare q against its over
ow bu
ket and report any
a di
tionary of 69,069 English words, from where we element near enough.
randomly
hose queries. The distan
e in this
ase is We must limit the size of the over
ow bu
kets in
the edit distan
e, that is, minimum number of
har- order to maintain a reasonable sear
h eÆ
ien
y. We
a
ter insertions, deletions and repla
ements to make rebuild a subtree when its over
ow bu
ket ex
eeds a
the strings equal. The se
ond spa
e is the real unitary given size. The main question is whi
h is the tradeo
ube in dimension 15 using Eu
lidean distan
e. We in pra
ti
e between re
onstru
tion
ost and query
ost.
generated 100,000 random points with uniform distri- As smaller over
ow bu
kets are permitted, we rebuild
bution. For the queries, we build the indexes with 90% the tree more often and improve the query time, but
of the points and use the other 10% (randomly
hosen) the
onstru
tion time raises.
Construction cost for n = 69,069 Query cost per element for n = 69,069 words
800 60000
Distance evaluations (x 10,000)
700 50000
Distance evaluations
600
40000
500
30000
400
20000
300
10000 Bucket size = 250
200 Bucket size = 500
Bucket size = 1000
100 0
0 100 200 300 400 500 600 700 800 900 1000 1 2 3 4
Size of overflow bucket Search radius
Construction cost for n = 100,000 vectors dim. 15 Query cost per element for n = 100,000 vectors dim. 15
1800 90000
Distance evaluations (x 10,000)
1600 85000
Distance evaluations
1400 80000
1200 75000
1000 70000
800 65000
Bucket size = 250
600 60000 Bucket size = 500
Bucket size = 1000
400 55000
0 100 200 300 400 500 600 700 800 900 1000 0.01 0.1 1
Size of overflow bucket Percentage of database retrieved
Figure 5. Constru
tion
osts using over
ow bu
kets. Figure 6. Sear
h
osts using over
ow bu
kets.
stati
onstru
tion provided the
orre
t bu
ket size is
Figure 5 shows the
ost of the
onstru
tion using
hosen. For example, with bu
ket size 500 we obtain
dierent bu
ket sizes, whi
h exhibits interesting
u
- almost the same sear
h
osts as for the stati
version, at
tuations and in some
ases
osts even less than a stati
the modest pri
e of 10% extra
onstru
tion
ost for the
onstru
tion. This is possible be
ause many un
lassi- di
tionary and 30% for the ve
tors. The main problem
ed elements are left in the bu
kets. For example, for in this method is its high sensitivity to the
u
tuations,
bu
ket size 1,000, almost all the elements are in over- whi
h makes it diÆ
ult to sele
t a good bu
ket size.
ow bu
kets in the di
tionary
ase and almost 60% The intermediate bu
ket size 500 works well be
ause
in the ve
tor
ase. These
u
tuations appear be
ause at this point the elements in over
ow bu
kets are 30%
a larger bu
ket size may produ
e more rebuilds than in the di
tionary and 15% in the ve
tors.
a smaller one for a given set size n. The ee
t is well
known, for example it appears when studying the num- 4.3. A First-Fit Strategy
ber of splits as a fun
tion of the B-tree page size [1℄.
Figure 6 shows the sear
h
osts using over
ow bu
k- Yet another solution is to
hange our best-t strategy
ets. We sear
hed with xed radius 1 to 4 in the di
tio- to put elements inside the bags of the neighbors of a
nary example and with radii retrieving 0.01%, 0.1% at
onstru
tion time. An alternative, rst-t, is to put
and 1% of the set in the ve
tor example. We also ea
h node in the bag of the rst neighbor
loser than
performed nearest neighbor sear
h experiments, whi
h a to q. Determining N (a) and the bag of ea
h other
yielded similar results and are omitted for la
k of spa
e. element
an now be done all in one pass.
As
an be seen by
omparing the results to those With the rst-t strategy, however, we
an easily
of Figure 8, this te
hnique is
ompetitive against the add more elements by pretending that the new in
om-
Construction cost for n = 69,069 words Query cost per element for n = 69,069 words
1600 50000
Static
Distance evaluations (x 10,000)
1400 First-Fit 45000
Timestamp (up) 40000
Distance evaluations
1200 Timestamp (down)
35000
1000 30000
800 25000
600 20000
15000
400 Best-Fit
10000 First-Fit
200 5000 Timestamp (up)
Timestamp (down)
0 0
10 20 30 40 50 60 70 80 90 100 1 2 3 4
Percentage of database used Search radius
Construction cost for n = 100,000 vectors dim. 15 Query cost per element for n = 100,000 vectors dim. 15
1400 90000
Static
Distance evaluations (x 10,000)
1200 First-Fit 85000
Timestamp (down)
Distance evaluations
Timestamp (up) 80000
1000
75000
800
70000
600
65000
400
60000
Best-Fit
200 55000 First-Fit
Timestamp
0 50000
10 20 30 40 50 60 70 80 90 100 0.01 0.1 1
Percentage of database used Percentage of database retrieved
Figure 7. Constru
tion
osts using rst-t and using Figure 8. Sear
h
osts using rst-t and the two
timestamps. versions of the timestamping te
hnique.
ing element x was the last one in the bag, whi
h means Figure 8 shows sear
h times. As
an be seen, the
that when it be
omes a neighbor of a it
an be simply sear
h overhead of the rst-t strategy is too high, at a
added as the last neighbor of a, and there were no later point that makes the stru
ture not
ompetitive against
elements that had the
han
e of getting into x. This other existing ones.
allows building the stru
ture by su
essive insertions.
Figure 7 shows that the
onstru
tion (stati
or dy- 4.4. Timestamping
nami
) using rst-t is mu
h
heaper than using best-
t. Moreover, rst-t
osts exa
tly the same and pro- An alternative that has resemblan
es with the two
du
es the same tree in the stati
or the dynami
ase. previous but is more sophisti
ated
onsists in keeping a
Range sear
hing under the rst-t strategy is a lit- timestamp of the insertion time of ea
h element. When
tle dierent. We
onsider the neighbors fv1 ; : : : ; vk g inserting a new element, we add it as a neighbor at the
of a in order. We perform the minimization while appropriate point using best-t and do not rebuild the
we traverse the neighbors. That is, we enter into tree. Let us
onsider that neighbors are added at the
the subtree of v1 if d(q; v1 ) d(q; a) + 2r; into the end, so by reading them left to right we have in
reasing
subtree of v2 if d(q; v2 ) min(d(q; a); d(q; v1 )) + 2r; insertion times. It also holds that the parent is always
and in general into the subtree of vi if d(q; vi ) older than its
hildren.
min(d(q; a); d(q; v1 ); : : : ; d(q; vi 1 )) + 2r. This is be- As seen in Figure 7, this alternative
an
ost a bit
ause vi+j
an never take out an element from vi . more or a bit less than stati
best-t depending on the
ase. Two versions of this methods, labeled \up" and namism. In the
ase of the di
tionary, the timestamp-
\down" in the plot,
orrespond to how to handle the ing te
hnique is signi
antly worse than the stati
one
ase of equal distan
es to the root and to the
losest (although the \up" behaves slightly better for nearest
neighbor when inserting a new element. The former neighbor sear
hing). The problem is that the \up" ver-
inserts the element as a new neighbor and the latter sion is mu
h more
ostly to build, needing more than
sends it to the subtree of the
losest neighbor. This 3 times the stati
onstru
tion
ost.
makes a dieren
e only in dis
rete distan
es.
At sear
h time, we
onsider the neighbors
fv1 ; : : : ; vk g of a from oldest to newest. We perform 4.5. Inserting at the Fringe
the minimization while we traverse the neighbors, ex-
a
tly as in Se
tion 4.3. This is be
ause between the
insertion of vi and vi+j there may have appeared new Yet another alternative is as follows. We
an relax
elements that preferred vi just be
ause vi+j was not Condition 1 (Se
tion 3.1), whose main goal is to guar-
yet a neighbor, so we may miss an element if we do not antee that if q is
loser to a than to any neighbor in
enter into vi be
ause of the existen
e of vi+j . N (a) then we
an stop the sear
h at that point. The
Note that, although the sear
h pro
ess is the same as idea is that, at sear
h time, instead of nding the
los-
under rst-t, the insertion puts the elements into their est
among fag[ N (a) and entering into any b 2 N (a)
losest neighbor, so the stru
ture is more balan
ed. su
h that d(q; b) d(q;
) + 2r, we ex
lude the sub-
Up to now we do not really need timestamps but tree root fag from the minimization. Hen
e, we always
just to keep the neighbors sorted. Yet a more so-
ontinue to the leaves by the
losest neighbor and oth-
phisti
ated s
heme is to use the timestamps to re- ers
lose enough. This seems to make the sear
h time
du
e the work done inside older neighbors. Say that slightly worse, but the
ost is marginal.
d(q; vi ) > d(q; vi+j ) + 2r . We have to enter into vi The benet is that we are not for
ed anymore to put
be
ause it is older. However, only the elements with a new inserted element x as a neighbor of a, even when
timestamp smaller than that of vi+j should be
onsid- Condition 1 would require it. That is, at insertion time,
ered when sear
hing inside vi ; younger elements have even if x is
loser to a than to any element in N (a), we
seen vi+j and they
annot be interesting for the sear
h have the
hoi
e of not putting it as a neighbor of a but
if they are inside vi . As parent nodes are older than inserting it into its
losest neighbor of N (a). At sear
h
their des
endants, as soon as we nd a node inside the time we will rea
h x be
ause the sear
h and insertion
subtree of vi with timestamp larger than that of vi+j pro
esses are similar.
we
an stop the sear
h in that bran
h, be
ause its sub- This freedom opens a number of new possibilities
tree is even younger.
that deserve a mu
h deeper study, but an immediate
An alternative view, equivalent as before but fo
us-
onsequen
e is that we
an insert always at the leaves
ing on maximum allowed radius instead of maximum of the tree. Hen
e, the tree is read-only in its top part
allowed timestamp, is as follows. Ea
h time we enter and it
hanges only in the fringe.
into a subtree y of vi , we sear
h for the siblings vi+j
of vi that are older than y . Over this set, we
ompute However, we have to permit the re
onstru
tion of
the maximum radius that permits to avoid pro
essing small subtrees so as to avoid that the tree be
omes
y , namely ry = max(d(q; vi ) d(q; vi+j ))=2. If it holds
almost a linked list. So we permit inserting x as a
r < ry , we do not need to enter into the subtree y .
neighbor when the size of the subtree to rebuild is small
Let us now
onsider nearest neighbor sear
hing. As- enough, whi
h leads to a tradeo between insertion
ost
sume that we are
urrently pro
essing node vi and in- and quality of the tree at sear
h time.
sert its
hildren y in the priority queue. We
ompute Figure 9 shows the
onstru
tion
ost for dierent
ry as before and insert it together with y in the priority maximum tree sizes that
an be rebuilt. As
an be seen,
queue. Later, when the time to pro
ess y
omes, we permitting a tree size of 50 yields the same
onstru
tion
onsider our
urrent sear
h radius r and dis
ard y if
ost of the stati
version.
< r . If we insert a
hildren z of y , we put it the
r y Finally, Figure 10 shows the sear
h times using this
value min(ry ; rz ). te
hnique. As
an be seen, using a tree size of 50 per-
Figure 8
ompares this te
hnique against the stati
mits the same and even better sear
h time
ompared
one. As it
an be seen, this is an ex
ellent alterna- to the stati
version, whi
h shows that it may be benef-
tive to the stati
onstru
tion in the
ase of our ve
- i
al to move elements downward in the tree. This fa
t
tor spa
e example, providing basi
ally the same
on- makes this alternative a very interesting
hoi
e deserv-
stru
tion and sear
h
ost with the added value of dy- ing more study.
Construction cost for n = 69,069 words Query cost per element for n = 69,069 words
2200 45000
Size = 10
Distance evaluations (x 10,000)
2000 40000 Size = 50
1800 Size = 100
Distance evaluations
35000 Size = 500
1600 30000 Size = 1000
1400 Best-Fit
25000
1200
20000
1000
15000
800
600 10000
400 5000
200 0
0 100 200 300 400 500 600 700 800 900 1000 1 2 3 4
Tree size allowed to reconstruct Search radius
Construction cost for n = 100,000 vectors dim. 15 Query cost per element for n = 100,000 vectors dim. 15
7000 85000
Distance evaluations (x 10,000)
6000 80000
Distance evaluations
75000
5000
70000
4000 65000
3000 60000
Size = 10
55000 Size = 50
2000
50000 Size = 100
Size = 500
1000 45000 Size = 1000
Best-Fit
0 40000
0 100 200 300 400 500 600 700 800 900 1000 0.01 0.1 1
Tree size allowed to recontruct Percentage of database retrieved
Figure 9. Constru
tion
osts inserting at the fringe. Figure 10. Sear
h
osts using insertion in the fringe.
5. Con
lusions Other alternatives, su
h as rebuilding and rst-t,
proved to be not
ompetitive, although the latter oers
We have presented several te
hniques to modify the very low
onstru
tion
osts, whi
h
ould be interesting
sa-tree in order to make it a dynami
data stru
ture despite its mu
h higher sear
h
ost.
supporting insertions, without degrading its
urrent It is
lear now that making the sa-tree dynami
is af-
performan
e. We have shown that there are many fordable and that the stru
ture
an even be improved in
more alternatives than what appears at a rst glan
e, a dynami
setup,
ontrary to our previous assumption
and that the invariants of the sa-tree
an be relaxed in that there would be a
ost for the dynamism. On the
ways unforeseen before this study (e.g. the fa
t that other hand, we need to pursue more in the most promis-
we
an de
ide whether or not to add neighbors). ing alternatives in order to understand them better.
From the
hoi
es we have
onsidered, the use of over- Moreover, we have not
onsidered deletions yet. These
ow bu
kets shows that it is possible to obtain
on- seem more diÆ
ult but always
an be treated by mark-
stru
tion and sear
h times similar to those of the stati
ing the nodes as deleted and making periodi
rebuilds.
version, although the
hoi
e of the bu
ket size deserves This work is a rst step of a broader proje
t [20℄
more study. Timestamping has also shown
ompetitive whi
h aims at a fully dynami
data stru
ture for sear
h-
in some metri
spa
es and not so attra
tive in others, ing in metri
spa
es, whi
h
an also work on se
ondary
a fa
t deserving more study. Finally, inserting at the memory. We have not tou
hed this last aspe
t in this
fringe has shown the potential of even improving the paper. A simple solution to store the sa-tree in se
-
performan
e of the stati
version, although studying ondary storage is to try to store whole subtrees in disk
the ee
t of the size of the fringe is required. pages so as to minimize the number of pages read at
sear
h time. This has an interesting relationship with [14℄ L. Mi
o, J. On
ina, and R. Carras
o. A fast bran
h
inserting at the fringe (Se
tion 4.5), not only be
ause and bound nearest neighbor
lassier in metri
spa
es.
the top part of the tree is read-only, but also be
ause Pattern Re
ognition Letters, 17:731{739, 1996. Else-
we
an
ontrol the maximum arity of the tree so as to vier.
make the neighbors t in a disk page. [15℄ L. Mi
o, J. On
ina, and E. Vidal. A new version of
the nearest-neighbor approximating and eliminating
sear
h (aesa) with linear prepro
essing-time and mem-
Referen
es ory requirements. Pattern Re
ognition Letters, 15:9{
17, 1994. Elsevier.
[16℄ G. Navarro. Sear
hing in metri
spa
es by spatial
[1℄ R. Baeza-Yates and P. Larson. Performan
e of B+ - approximation. In Pro
. 6th International Sympo-
trees with Partial Expansions. IEEE Transa
tions on sium on String Pro
essing and Information Retrieval
Knowledge and Data Engineering, 1(2):248{257, 1989. (SPIRE'99), pages 141{148. IEEE CS Press, 1999.
[2℄ J. Bentley. Multidimensional binary sear
h trees in [17℄ G. Navarro. Sear
hing in metri
spa
es by spatial
database appli
ations. IEEE Transa
tions on Software approximation. Te
hni
al Report TR/DCC-2001-4,
Engineering, 5(4):333{340, 1979. Dept. of Computer S
ien
e, Univ. of Chile, 2001.
[3℄ W. Burkhard and R. Keller. Some approa
hes to best- ftp://ftp.d
.u
hile.
l/pub/users/gnavarro/
mat
h le sear
hing. Communi
ations of the ACM, jsat.ps.gz.
16(4):230{236, 1973. [18℄ S. Nene and S. Nayar. A simple algorithm for nearest
[4℄ T. Bozkaya and M. Ozsoyoglu. Distan
e-based index- neighbor sear
h in high dimensions. IEEE Transa
-
ing for high-dimensional metri
spa
es. In Pro
. ACM tions on Pattern Analysis and Ma
hine Intelligen
e,
Conferen
e on Management of Data (SIGMOD'97), 19(9):989{1003, 1997.
pages 357{368, 1997. Sigmod Re
ord 26(2). [19℄ H. Nolteimer, K. Verbarg, and C. Zirkelba
h.
[5℄ S. Brin. Near neighbor sear
h in large metri
spa
es. In Monotonous Bise
tor Trees { a tool for eÆ
ient par-
Pro
. of the 21st Conferen
e on Very Large Databases titioning of
omplex s
henes of geometri
obje
ts. In
(VLDB'95), pages 574{584, 1995. Data Stru
tures and EÆ
ient Algorithms, LNCS 594,
[6℄ R. Baeza-Yates, W. Cunto, U. Manber, and S. Wu. pages 186{203, 1992.
Proximity mat
hing using xed-queries trees. In Pro
. [20℄ N. Reyes. Dynami
data stru
tures for sear
hing met-
5th Conferen
e on Combinatorial Pattern Mat
hing ri
spa
es. MS
. Thesis, Univ. Na
. de San Luis, Ar-
(CPM'94), LNCS 807, pages 198{212, 1994. gentina, 2001. In progress. G. Navarro, advisor.
[7℄ E. Chavez, J. Marroqun, and R. Baeza-Yates. [21℄ J. Uhlmann. Implementing metri
trees to satisfy gen-
Spaghettis: an array based algorithm for similarity eral proximity/similarity queries. Manus
ript, 1991.
queries in metri
spa
es. In Pro
. 6th International [22℄ J. Uhlmann. Satisfying general proximity/similarity
Symposium on String Pro
essing and Information Re- queries with metri
trees. Information Pro
essing Let-
trieval (SPIRE'99), pages 38{46. IEEE CS Press, ters, 40:175{179, 1991. Elsevier.
1999. [23℄ E. Vidal. An algorithm for nding nearest neighbors
[8℄ E. Chavez, J. Marroqun, and G. Navarro. Fixed in (approximately)
onstant average time. Pattern
queries array: A fast and e
onomi
al data stru
ture Re
ognition Letters, 4:145{157, 1986.
for proximity sear
hing. Multimedia Tools and Appli- [24℄ P. Yianilos. Data stru
tures and algorithms for near-
ations, 14(2):113{135, 2001. Kluwer. est neighbor sear
h in general metri
spa
es. In Pro
.
[9℄ E. Chavez and G. Navarro. An ee
tive
lustering al- 4th ACM-SIAM Symposium on Dis
rete Algorithms
gorithm to index high dimensional metri
spa
es. In (SODA'93), pages 311{321, 1993.
Pro
. 7th International Symposium on String Pro
ess- [25℄ P. Yianilos. Lo
ally lifting the
urse of dimensionality
for nearest neighbor sear
h. In Pro
. 11th ACM-SIAM
ing and Information Retrieval (SPIRE'00), pages 75{ Symposium on Dis
rete Algorithms (SODA'00), 2000.
86. IEEE CS Press, 2000.
[10℄ E. Chavez, G. Navarro, R. Baeza-Yates, and J. Mar-
roqun. Sear
hing in metri
spa
es. ACM Computing
Surveys, 2001. To appear.
[11℄ P. Cia
ia, M. Patella, and P. Zezula. M-tree: an
eÆ
ient a
ess method for similarity sear
h in metri
spa
es. In Pro
. of the 23rd Conferen
e on Very Large
Databases (VLDB'97), pages 426{435, 1997.
[12℄ F. Dehne and H. Nolteimer. Voronoi trees and
lus-
tering problems. Information Systems, 12(2):171{175,
1987. Pergamon Journals.
[13℄ A. Guttman. R-trees: a dynami
index stru
ture for
spatial sear
hing. In Pro
. ACM Conferen
e on Man-
agement of Data (SIGMOD'84), pages 47{57, 1984.