A Hierarchical Semantic Overlay Approach to P2P Similarity Search
Duc A. Tran
Computer Science Department
University of Dayton
Dayton, OH 45469
[email protected]
1
Introduction
One of the most important problems in information retrieval is similarity search. Informally, the problem is:
given a similarity query, which can be a point query or
a range query, we need to return a set of contents that
are most relevant to the search criteria according to some
semantic distance function. We propose EZSearch, a
decentralized solution to this problem in the context of
Peer-to-Peer (P2P) networks. EZSearch features the following for a network of N users. First, queries can be answered with O(logk N ) worst-case search time and low
search overhead. Second, to maintain the hierarchy, a
node keeps track of only O(k) other nodes and failure recovery requires no more than O(k) reconnections; these
overheads are independent of the network size. Last but
not least, the number of objects whose indices are stored
at remote nodes is small and, therefore, so are the costs
of index migration, storage, and validity.
2
Peer-to-Peer Similarity Search
We consider a P2P network where each node has a set
of data objects to share with other nodes in the network. These data objects are described based on the vector space model used in information retrieval theory [1].
Each data object x is represented as a d-term semantic
vector Tx = (w1x , w2x , .., wdx ), where each dimension
ti reflects the keyword, concept, or term associated with
x and wix the weight to reflect the significance of ti in
representing the semantic of x. Without loss of generality, we assume that all the weight values belong to the
interval [0, 1].
We employ the commonly used Cosine distance function to measure the semantic similarity between two obT ·T
jects x and y: simdist(x, y) = cos−1 kTx kx2 kTyy k2 where
Tx · Ty is the dot product between vectors Tx and Ty
and k.k2 is the Euclidean vector norm. The smaller
simdist(x, y) is, the more semantically similar are X
and Y to each other.
We consider two types of queries: point queries and
range queries. A point query is described by a term vector Q = (w1Q , w2Q , .., wdQ ). We expect to return those
data objects x such that simdist(x, Q) is minimum. In
some applications, the user may also specify a small constant ǫ to find those objects such that simdist(x, Q) < ǫ.
There are two types of range query, namely simple and
composite. A simple range query is described by a hyperrectangular region Q = [min1Q , max1Q ] × [min2Q ,
max2Q ] × .. × [mindQ , maxdQ ]. A composite range
query is a set of simple range queries. For a range query
Q, we expect to return those data objects that belong to
the region Q. While the system is aimed to be fully decentralized, we assume that a new user knows at least one
existing user before the former can join the network.
3
Proposed Solution: EZSearch
The basic idea behind EZSearch is as follows. Peers (i.e.,
user nodes) are partitioned into clusters. Each cluster
contains nodes having similar contents and manages a
subspace of indices (peer location P , term vector Tx ),
or an index zone. For a search, the simplest solution is
to scan all the clusters, which, however, would incur a
linear search time. Alternatively, similar to using search
trees for logarithmic runtime search, we can build a decision hierarchical overlay on top of these clusters such
that the search scope will be reduced by some factor if
the query is forwarded from a layer of the hierarchy to a
lower layer.
For building the cluster overlay, we propose to use
the Zigzag hierarchy, which we originally devised for
streaming multimedia [2, 3]. The main advantage of
Zigzag is its capability to handle the dynamics of P2P
networks. We first present Zigzag and then propose how
similarity search can be fulfilled efficiently using this hierarchy.
3.1
Zigzag Hierarchy
head
Definition 1 [Zigzag hierarchy] A Zigzag-k hierarchy of
N nodes is a multi-layer hierarchy of clusters recursively
defined below: (k > 3 is a constant):
17
1
1. Layer 0 contains all peers.
2
5
2
5
3 4
2. If the number of peers at layer j is greater than 3k,
they are partitioned into clusters whose size is in [k,
3k]. Otherwise, we reach the highest layer, where
peers form only a single cluster. The size of this
highest-layer cluster is in [2, 3k].
3. A layer-j cluster designates two member peers as its
head and associate-head. The head automatically
appears at layer (j + 1). The cluster partition at
layer (j + 1) is the same as at layer j. An exception
applies to the highest-layer cluster in which only the
head role is needed but the associate-head role is
not necessary.
An illustration is given in the top diagram of Figure 1,
where 52 nodes are organized into a Zigzag-4 hierarchy.
Hereafter, we denote by head(.) and ahead(.) the head
and associate-head, respectively, of a cluster or a peer.
Since a peer may have different associate-heads at different layers, we use the notation aheadi (P ) to refer to
the associate-head of P at layer i. For instance, in Figure
1, ahead0 (22) = 21, ahead1 (22) = 17. Below are the
terms we use for the rest of the paper:
• Foreign head: A non-head non-associate-head clustermate Y of a peer X at layer j > 0 is called a
“foreign head” of layer-(j − 1) clustermates of X.
• Super cluster: A layer-j (j > 0) cluster is the super cluster of any layer-(j − 1) cluster whose head
appears in the layer-j cluster.
Definition 2 [Connectivity in Zigzag hierarchy] (illustrated by the top diagram of Figure 1)
• Intra-cluster connectivity: In a cluster, the
associate-head has a link to every other non-head
peer. E.g., in Figure 1 (top diagram), associate-head
17 of its layer-1 cluster has a link to all of its layer-1
non-head clustermates (peers 2, 5, 9, 13). An exception applies to the highest-layer cluster in which all
peers have a link from its head (because there is no
associate-head for this layer).
• Inter-cluster connectivity: The associate-head of a
cluster has a link from one of its foreign heads. E.g.,
in Figure 1 (top diagram), associate-head 18 at layer
0 has a link from peer 13, which is one of peer 18’s
associate-head others
6
7 8
9
13
9
10 13
11 12
22
26
22
26
29
33
14 17
15 16
18
19 20
21
22
25
23 24
26 29 30 33
27 28 31 32
41
37
34
35 36
38 41
37
39 40
45
49
42 45 46 49
43 44 47 48
50
51 52
22: I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13
26: I1 I2 I3 I4 I5 I6
29: I7 I8 I9 I10 I11 I12 I13
17: I1 I2 I3 I4 I5 I6
2: I2
5: I3
6: I2
10: I3
9: I1 I4
1: I1
14: I4
13: I5 I6
18: I5
33: I7 I8
25: I7
30: I8
37: I9
34: I9
41: I10
38: I10
45: I11 I13
42: I11
50: I13
49: I12
46: I12
21: I6
Figure 1: Top diagram: Connectivity in a zigzag-4 hierarchy of 52 nodes; Bottom diagram: Corresponding
index zone assignments
foreign heads. There is an exception: for a secondhighest-layer cluster, if its associate-head does not
have a foreign head, the associate-head has a link
from the head of the highest-layer cluster. For instance, associate-head 17 at layer 1 has a link from
peer 26 which is the head of the highest-layer cluster.
The above rules guarantee a tree structure including
all the peers; we call this tree the Zigzag tree. A Zigzagk hierarchy of N peers provides the following desirable
properties: (see [3] for complete proofs): (1) The maximum nodal degree in the Zigzag tree is 6k − 3, (2) The
maximum height of the Zigzag tree is 2logk N + 1, (3)
Recovery of a peer failure requires at most 6k − 2 peer
reconnections, and (4) As peers join and leave, a cluster
may be split or merged with another cluster to satisfy the
[k, 3k] cluster size constraint. The worst-case number of
peer reconnections due to a split or merger is O(k).
3.2
Index Zone Assignment Policy
We propose to organize peers into a Zigzag hierarchy.
Each cluster of this hierarchy is assigned a zone of the entire index space. Zone assignment is important to searching and follows the policy described below.
Definition 3 [Zone Assignment Policy]
1. At layer 0: Each layer-0 cluster owns a nonoverlapped index zone, which is a set of hyperrectangles {[α1 , β1 ] × [α2 , β2 ] × ... × [αd , βd ]}, such
that the union of all the zones of layer-0 clusters is
I = [0, 1]d . This zone is known to both the head and
associate-head of the cluster, and also said to be
“covered by”, or “owned by”, the associate-head.
The head will store the indices of those objects that
belong to peers outside this cluster but lie inside its
index zone.
2. At layer j > 0: Each peer P keeps a list of pairs
(childi , zone(childi )) where childi is a child node
of P in the Zigzag tree and zone(childi ) the index
zone covered by childi . The index zone covered by
peer P , denoted by zone(P ), is the union of these
child zones. The index zone owned by a cluster is
that covered by the associate-head of this cluster.
As an example, we consider the hierarchy in Figure 1
and suppose that the index zones owned by the 13 layer-0
clusters are I1 , I2 , .., I13 (respectively, from left to right).
Thus, zone(1) = I1 , zone(5) = I2 , zone(9) = I3 , etc.
Because peer 9 has two children (peer 1 and peer 14),
peer 9 keeps the value {(1, I1 ), (14, I4 )} and zone(9) =
I1 ∪ I4 . The index zone assignments are similar for other
peers and shown in Figure 1 (bottom diagram). Since
peers other than the heads and associate-heads at layer 0
do not own any index zone, they are not present in this
index zone tree.
The advantage of the zone assignment policy is its support for efficient search. A search query just follows the
links in the Zigzag tree to branches that lead to the smallest index zone(s) containing the query. The next subsection details the search algorithm.
3.3
Search Algorithms
We assume that peers are already organized into a Zigzag
hierarchy and index zones are assigned to peers and clusters according to the zone assignment policy. We present
here only the algorithm for range-query search. Algorithms for point queries can be generalized from this algorithm and can be found in [4]. Supposing that a peer
P submits a range query Q, there are two scenarios:
Case 1: P is a leaf in the Zigzag tree (e.g., peers 15,
36, 40 in Figure 1) and P needs to process query
Q. Since P does not have any index zone information, it sends the query to its associate-head ahead0 (P ).
ahead0 (P ) computes Q1 = Q ∩ zone(P ). If Q1 6=
∅, some results of Q, that correspond to Q1 , can be
found locally. Indeed, ahead0 (P ) just needs to broadcast query Q1 to all layer-0 clustermates asking them
to return to peer P the objects inside Q1 . Furthermore, when head(P ) receives Q1 , if it stores any index
(peer location P ′ , term vector Tx ) such that Tx ∈ Q1 ,
head(P ) asks peer P ′ to send object x to peer P . We
must also return the results that correspond to query
Q − Q1 if Q − Q1 6= ∅ because these results are not
in the local cluster. In this case, ahead0 (P ) creates a
new query Q2 = Q − Q1 . How ahead0 (P ) processes
query Q2 is similar to the case below.
Case 2: P is a non-leaf node in the Z-tree (e.g., peers
22, 37, 42 in Figure 1) and P needs to process query Q.
In this case, P must own a zone zone(P ). Query Q is
broken into two subqueries Q1 = Q ∩ zone(P ) and Q2
= Q − Q1 , which will be handled in parallel as follows.
• Query Q1 : If Q1 = ∅, nothing needs to be done.
Else, the results of Q1 can be found in a layer-0
cluster reachable from peer P . By looking at the
list (childi , zone(childi )) for every child, P breaks
Q1 further into subqueries Q11 , Q12 , ... where
Q1i = Q1 ∩ zone(childi ). (It is easy to prove that
Q1i ∩ Q1j = ∅ for every i 6= j.) The results for
Q1i can be found in a layer-0 cluster reachable from
childi . Hence, peer P just needs to forward these
subqueries to the corresponding child peers. The
handling of such a subquery at the corresponding
child resembles that at peer P .
• Query Q2 : If Q2 = ∅, nothing needs to be done.
Else, the results satisfying Q2 cannot be found in
any cluster reachable from P . In this case, P just
needs to forwards Q2 to the parent of P in the
Zigzag tree. The handling of query Q2 at this parent
resembles the way P handles the original query Q.
Eventually, all the subqueries, created when necessary as
above, will reach layer-0 clusters where the corresponding results can be found locally (like in Case 1). The
collection of all these results is the final result for the
original query Q initiated by peer P .
The search path length is at most the maximum distance in hops between two peers in the Zigzag tree, or
4logk N + 2. The search overhead is proportional to the
total number of peers contacted for all the subqueries,
which depends on the range of the original query. In our
performance study, we found that this overhead is indeed
very small.
3.4
Hierarchy Construction
Initially, there is only one peer in the network. It is the
head of its self-formed cluster C, which grows larger as
subsequent peers join. The index zone owned by this
cluster is I = [0, 1]d and the ID of this zone is kept at
the head node. When the cluster size exceeds 3k, we
need to partition C into two smaller clusters, C0 and C1 ,
whose sizes are in the interval [k, 3k]. We propose to
partition I along a selected dimension tl into two halves
I0l = [0, 1]l−1 × [0, 1/2) × [0, 1]d−l and I1l = [0, 1]l−1
× [1/2, 1] × [0, 1]d−l , each to be owned by C0 and
C1 . It is possible that a peer in cluster C0 has an object in I1l (similarly, a peer in cluster C1 may have an
object in I0l ). In this case, we store the index of this object in the other cluster. The number of such objects is
called the index migration overhead. We want to minimize this overhead so that (1) the communication overhead due to index relocation is low, and (2) peers in the
same cluster have highlyPsimilar objects.
P This isPequivalent to minimizing F = P ∈C0 nP
1l +
P ∈C1 n0l where
nP
=
cardinality({object
x
∈
P
|
T
∈
Iil }). The alx
il
gorithm for this purpose is run by head(C) - the head of
cluster C. Upon a request sent by head(C), each peer
P
P in C submits to head(C) a set of tuples (tl , nP
1l , n0l ),
for all l ∈ [1, d]. Upon receipt of those sets from all
the member peers, we can devise a simple greedy but
optimal algorithm for head(C) to find the best C0 , C1 ,
and dimension tl . The complexity of this algorithm is
O(dklog2 k).
Each cluster Ci randomly selects two nodes as its head
head(Ci ) and associate-head ahead(Ci ) (the old head
of cluster C, however, is preferred to remain head of the
newly created cluster it belongs to). The heads will automatically belong to layer 1 and form a new cluster. Since
layer 1 now is the highest layer, only the head needs to
be designated; this head is randomly chosen between the
two member peers. The index zone owned by this cluster
is the union of the zones owned by its child clusters; in
this case, it is I0l + I1l .
Having the Zigzag hierarchy initially constructed, we
need maintain it under network dynamics such as when a
peer publishes or removes objects, and joins or quits the
network. The detailed solutions to these sub-problems
are presented in [4], which shows that removal of a peer
requires O(k) peer reconnections, addition of a peer requires O(logk N ) peer contacts, and addition or removal
of an object also requires O(logk N ) peer contacts.
14
Average
Worst-case
12
10
8
6
4
2
0
5
10
15
20
25
Parameter k
1.8
Average
Worst-case
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
5
10
15
20
25
Parameter k
Figure 2: After the system runs for 5000 seconds, 10003
nodes are active. Each node has up to 10 2-d objects.
2000 queries are generated with ranges following a Zipf
distribution in which about 80% of the queries have a
volume of approximately 20% of the unit hypercube
overhead of 10003/(0.2 × 10003) = 5. EZSearch has
a normalized search overhead always less than 0.6 (on
average) and 1.8 (worst-case), and much smaller when
k is larger. EZSearch therefore is fast and highly efficient. Our future work includes (1) refining the current
algorithms for stronger index locality preservation within
each cluster, and (2) considering various distributions of
objects over peers.
References
4
Simulation Results and Future Work
We conducted simulation for EZSearch. Peers arrived at
rate 2 peers per second and might quit the network randomly at anytime. Thus the contents and indices stored
in the network changed dynamically. The results were
promising. For instance, Figure 2 shows the effect of the
constant k used in the Zigzag-k hierarchy. In all scenarios, the query and any of its subqueries do not travel more
than 12 nodes (among 10,003 nodes) before knowing
the locations of the answers. Normalized search overn
head is computed as N ×V
, where n is the number of
nodes a query and its subqueries visit during the search,
N the number of nodes currently in the system, and V
the volume of the query. For a query of volume 0.2, the
broadcast-based search would incur a normalized search
[1] M. Berry, Z. Drmac, and E. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review,
41(2):335–362, 1999.
[2] D. A. Tran, K. A. Hua, and T. T. Do. Zigzag: An efficient peer-to-peer scheme for media streaming. In
IEEE INFOCOM, San Francisco, CA, March-April
2003.
[3] D. A. Tran, K. A. Hua, and T. T. Do. A peer-to-peer
architecture for media streaming. IEEE Journal on
Selected Areas in Communications, 22(1), January
2004.
[4] D. A. Tran and H. Nguyen. EZSearch: Fast and scalable similarity search in peer-to-peer networks. Unpublished, January 2005.