Computer Networks 54 (2010) 1440–1459
Contents lists available at ScienceDirect
Computer Networks
journal homepage: www.elsevier.com/locate/comnet
HPC5: An efficient topology generation mechanism for Gnutella networks
Joydeep Chandra *, Santosh Kumar Shaw, Niloy Ganguly
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur 721 602, India
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, we propose a completely distributed topology generation mechanism named
Received 15 April 2009 HPC5 for Gnutella network. A Gnutella topology will be efficient and scalable if it generates
Received in revised form 10 November 2009 less number of redundant queries. This can be achieved if it consists of a fewer number of
Accepted 24 November 2009
short length cycles. Based on this principle, our protocol directs each peer to select neigh-
Available online 11 December 2009
Responsible Editor: I.F. Akyildiz
bors in such a way that any cyclic path present in the overlay network will not generate
any redundant query. We show that our approach can be deployed into the existing
Gnutella network without disturbing any of its parameters. We also show that the proba-
Keywords:
Peer-to-peer networks
bility of inconsistencies arising during topology generation, using our mechanism, which
Gnutella may lead to the formation of a small number of short length cycles is very low. However,
Topology generation we have also proposed an inconsistency handling protocol that detects such short length
cycles and effectively removes them. We implemented a Gnutella prototype to compare
and validate the efficiency of our protocol over existing Gnutella. Simulation results indi-
cate that our mechanism outperforms existing Gnutella in terms of network coverage
(the number of unique peers explored during query propagation in limited flooding) and
message complexity. Structural analysis indicates that the proposed enhancement con-
serves the robustness of existing Gnutella network. Finally, we draw comparisons of the
proposed protocol with a state-of-the-art topology optimization protocol named Distrib-
uted Cycle Minimization Protocol (DCMP); the simulation results indicate that HPC5 out-
performs DCMP in terms of message overhead and network coverage.
Ó 2009 Elsevier B.V. All rights reserved.
1. Introduction But the main problem of these kinds of networks is scala-
bility due to the generation of a large number of redundant
Peer-to-peer (p2p) network is an overlay network, use- messages during query search. Consequently, as these net-
ful for many purposes like file-sharing, distributed compu- works are becoming more popular, the quality of service is
tation, etc. Depending upon the topology formation, p2p degrading rapidly [8,12].
networks are broadly classified as structured and unstruc- To make the network scalable, Gnutella [1–3] is contin-
tured. An unstructured p2p network is formed when the uously upgrading it’s features and introducing new con-
overlay links are established arbitrarily. Decentralized cepts. All these improvements can be categorized into
(fully distributed control), unstructured p2p networks two broad areas: improvement in search techniques and
(Gnutella, FastTrack etc.) are the most popular file-sharing modification of the topological structure of the overlay
overlay networks. The absence of a structure and central network to enhance search efficiency. In enhanced search
control makes such systems much more robust and highly techniques, several improvements like Time-To-Live (TTL),
self-healing compared to the structured systems [11,17]. Dynamic query, Query-caching and Query Routing Protocol
(QRP) have been introduced. One of the most significant
topological modifications in unstructured network was
* Corresponding author. Tel.: +91 9434305806.
E-mail addresses:
[email protected] (J. Chandra), santosh.
done by inducing the concept of super-peer (ultra-peer)
[email protected] (S.K. Shaw),
[email protected] (N. Ganguly). with a two-tier network topology.
1389-1286/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.comnet.2009.11.017
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1441
1 2
1 2
3 2
S
S 2
3
1 2
1 2
(a) (b)
Fig. 1. Effect of topology structure on limited flood based search. The number inside the circle represents the TTL value required to reach that node from
start node S.
The basic search mechanism adhered by Gnutella is lim- Liu et al. [9,10] observed that the structural mismatch
ited flooding [1–3]. In flooding, a peer that searches for a between the overlay and the underlying network topology
file, issues a query and sends it to all of its neighbor peers. is a major cause for traffic redundancy in p2p networks.
The peer that receives the query forwards it to all its neigh- Peers forming blind overlay connection without knowl-
bors except the neighbor from which it is received. By this edge of the underlying physical network can generate huge
way, a query is propagated up to a predefined number of redundant traffic between autonomous systems. They pro-
hops (TTL) from the source peer. Recent versions of Gnutel- posed a location aware topology matching algorithm (LTM)
la uses dynamic querying protocol [5], where the TTL fol- to handle this problem. The basic objective of LTM is to
lowed is generally 1 or 2 for popular searches; however, detect the peers within same autonomous systems on the
the query with TTLð3Þ (numeric value inside parenthesis basis of delay incurred by the messages, and then reorga-
represents the number of hops to search with) is initiated nize the overlay links so that peers within the same auton-
for rare searches. omous systems are grouped together. LTM uses three
The main goal of this paper is to improve the scalability fundamental operations to achieve this objective. (a)
of the Gnutella network by reducing the redundant mes- TTL2-Detector Flooding: Each peer floods a special message,
sages. One of the ways to achieve this is to modify the named TTL2-detector message, periodically. The source
overlay network, so that small size loops get eliminated peer of this message stamps its IP address and the time-
from the overlay topology. The rationale behind the prop- stamp; a neighbor on receiving this message appends its
osition is explained through Fig. 1. In this figure, both net- IP address and the TTL1 time-stamp and forwards it to its
works have the same number of connections. With a TTLð2Þ neighbors. Peers receiving the detector message can deter-
flooding, the network in Fig. 1a discovers four peers at the mine the connection speeds from the message time-
expense of seven messages, whereas the network in Fig. 1b stamps. If a peer P receives multiple detector messages
discovers six peers without any redundant messages. This from same source, it initiates the second operation named,
happens due to the absence of any 3-length cycle in the slow connection cutting. (b) Slow Connection Cutting: Peer
network of Fig. 1b. On generalizing, we can say that for a P, on receiving duplicate detector messages from a source
TTL(r) flooding, networks devoid of cycles of length less node determine the slowest connection link from the mes-
than (2r + 1) do not generate any redundant messages. sage time-stamps. If the slowest link is a link of node P, the
We refer such networks which do not have any cycles up link is disconnected. (c) Source Peer Probing: A peer P on
to length (r 1) as cycle-r network.1 In this paper, we pro- receiving a message with TTL ¼ 0 from source S through
pose a handshake protocol that generates a cycle-5 network peer N probes the connection cost of the link SP and also
topology. This topology will thus rarely produce any redun- obtains the connection cost of links SN and NP from the
dant message while performing normal TTLð2Þ search. The message. If link SP does not have the highest cost, the link,
strength of our proposed mechanism is its simplicity and SP, is created and NP is disconnected. Although, this algo-
the ease of deployment over existing Gnutella networks rithm improves search efficiency but the traffic generated
along with its power to generate topologies having high effi- due to the detector messages is enormously high; with n
ciency in terms of message complexity and network cover- peers, the traffic incurred is Oðn2 Þ. A similar class of overlay
age [18]. topology based on distance between a node and its neigh-
bors in the physical network structure is presented in [13].
Papadakis et al. [15] presented an algorithm to monitor
1.1. Related work the ratio of duplicate messages through each network con-
nection. Each peer in the network monitors the number of
P2P traffic has grown immensely in the recent years and duplicates received through each of its link, over the total
they constitute a large portion of the total Internet traffic. number of received messages. On receiving a duplicate
Hence developing suitable mechanisms so as to control message through a link, a peer P sends a feedback to the
excessive traffic, thus reducing the burden on ISP’s, besides upstream node, say Q, of the link. The upstream peer Q
maintaining an adequate search performance, has become maintains a record of the number of duplicate messages
an important research issue. We discuss certain proposed sent by itself to the downstream node. Consequently, when
mechanisms that broadly attempt to modify the topology the ratio exceeds a certain threshold value, the node does
in unstructured p2p networks to solve the excessive traffic not forward any query through that connection. However,
problem and improve search performance. this mechanism suffers from certain inherent drawbacks.
Although a peer can receive several duplicate messages
1
A formal proof of the statement for r ¼ 2 is stated in Appendix B. through a link from a particular source node, but that link
1442 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459
might be the only path to several other source nodes. compatibility of implementing our protocol in Gnutella
Disconnecting that link reduces the network coverage of networks are described in Section 4. In Section 5, we discuss
the peers. Moreover this method is not suitable in case of the effects of inconsistency on HPC5 and state suitable algo-
high peer churn where peers arrive and depart frequently. rithms to reduce its effects. We evaluate the performance
Zhu et al. [25] very recently presented a Distributed Cy- and analyze the robustness of the evolved networks
cle Minimization Protocol (DCMP) to improve the scalability through simulation in Section 7. Moreover, we also draw
of Gnutella-like networks by reducing redundant mes- comparisons of our algorithm with DCMP. In Section 8 we
sages. They have pointed the same concept of elimination conclude and draw directions for future work.
of short-length cycles cycles. However, this is demand A list of main notations that will be used throughout the
driven and involves a lot of control overhead. The algo- paper is summarized in Table 1 for ready reference.
rithm does not preserve the Gnutella parameters (like de-
gree distribution, average peer distance, diameter, etc.), 2. Basic system model of Gnutella 0.6
hence robustness of the evolved network is not main-
tained. In our work, we take into consideration all the as- In order to carry on experiments, a basic version of Gnu-
pects like control overhead, network coverage, and tella 0.6 [1–3] has been implemented. The basic Gnutella
robustness, and propose a holistic yet simple approach to consists of a large collection of nodes that are assigned un-
topology formation. The algorithm is initiated as soon as ique identifiers and which communicate through message
a peer enters in the network, rather than having it demand exchanges.
driven. Thus the algorithm works during the bootstrap
phase when the network is forming so that less overhead 2.1. Topology
is involved afterwards. It further eliminates the inconsis-
tencies that might occur during the bootstrap phase. We Gnutella 0.6 is a two-tier overlay network, consisting of
later compare the performance of our algorithm with two types of nodes: ultra-peer and leaf peer (the term peer
DCMP, in terms of message overhead and network cover- represents both ultra and leaf-peer). An ultra-peer is con-
age, using simulations. It is observed that for large network nected with a limited number of other ultra-peers and
size, our algorithm outperforms DCMP in both these leaf-peers. A leaf-peer is connected with some ultra-peers.
aspects. However, there is no direct connection between any two
Yet another class of algorithm exists that works on the leaf-peers in the overlay network. Yet another type of peer
margins of community formation depending upon similar is called legacy-peers, which are present in ultra-peer level
file interests. In these algorithms, the network topology is and do not accept any leaves. Since legacy-peers are negli-
restricted to certain clusters, where intra-cluster traffic gible in the network (they constitute approximately 0.3% of
can be high but inter-cluster traffic very low. the total peers [24]), hence we are not considering legacy-
Chen et al. [6] proposed a distributed technique to iden- peers in our further analysis.
tify peers sharing similar file interests and form overlay
connection among them. In this technique, each file f is
2.2. Basic search technique
identified by a set of attributes, say a1 ; a2 ; . . . ; an , based
on its specific application domain. For a pair of files, say
ðiÞ ðiÞ ðiÞ ðjÞ ðjÞ ðjÞ The network follows a limited flood based query search.
fi ¼ a1 ; a2 ; . . . ; an and fj ¼ a1 ; a2 ; . . . ; an , a feature func-
ðiÞ ðjÞ A query of an ultra-peer is forwarded to its leaf-peers with
tion Fða1 ; a1 Þ measures the correlation between the attri-
ðiÞ ðjÞ TTLð0Þ and to all its ultra-neighbors with one less TTL only
butes a1 and a1 , i.e. whether the two files fi and fj are
when TTL > 0. A leaf-peer does not forward a query
related through these attributes. This correlation is used
received from an ultra-peer. On the other hand ultra-peers
to measure a conditional probability, Prðfj jfi Þ, that repre-
perform query searching on behalf of their leaf-peers. The
sents the conditional probability that a peer will request
for a file fj such that it had earlier requested for a file fi .
When a new query for a file f arrives to a peer p, it returns Table 1
f if the file is present. Otherwise, it forwards the query to a Notations.
new peer q that had earlier requested a file fe that have the TTLðrÞ Query search with TTL ¼ r
highest conditional probability, Prðf jfe Þ among all other file Cycle-r A network which does not have any cycle up to
requests. The idea is that, since q has requested for file fe , network length ðr 1Þ
the probability that it has queried for the file f and obtained Cycle-3 Gnutella network
network
it is very high. However, there are two major drawbacks of N Total number of peers in the network
this approach; finding an appropriate feature function is U Total number of ultra-peers in the network
difficult, and secondly a large number of file-sharing infor- L Total number of leaf-peers in the network
mation needs to be disseminated that induces high control duu Avg. no. of ultra-neighbors of an ultra-peer
dul Avg. no. of leaf-neighbors of an ultra-peer
overhead.
dlu Avg. no. of ultra-neighbors of a leaf-peer
Hk Hit ratio to select kth ultra-neighbor
1.2. Organization of the paper hHi Average hit ratio of a peer
hHev i Average evolved hit ratio of a peer
A model of our network environment and basic hand- rth A peer at a distance of r hops. All immediate
neighbor neighbors are 1st neighbors, all neighbors of 1st
shake protocol is presented in Section 2. In Section 3 we
neighbors are 2nd neighbors and so on
have described our handshake protocol. The problems and
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1443
query of a leaf-peer is initially sent to its connected network. These features are obtained from the snapshots
ultra-peers. All the connected ultra-peers simultaneously collected by crawlers [3,19,22]. The properties of the Gnu-
forward the query to their neighbor ultra-peers up to a lim- tella [1,22] and simulated Gnutella networks are given in
ited number of hops. Since multiple ultra-peers are initiat- the Table 2. We next highlight the functional features of
ing flooding, a leaf-peer’s query will produce more our simulated Gnutella.
redundant messages if the distance between any two ul-
tra-neighbors is not enough. Gnutella 0.6 incorporates dy- 1. The simulator is a multi-threaded parallel execution
namic querying [5] over limited flooding as query search simulator in which the Gnutella processes running
technique. In dynamic querying, an ultra-peer incrementally at each nodes are implemented using separate
forwards a query in three steps (TTLð1Þ; TTLð2Þ; TTLð3Þ, threads. Each peer arrival spawns a new thread that
respectively) through each connection after measuring the follows the required bootstrap protocol to form
responsiveness in each step. The ultra-peer can stop for- neighbors.
warding a query at any step if it gets sufficient number of 2. The network grows from an initial set of 20 pre-existing
query hits. Consequently dynamic querying uses TTLð3Þ only ultra-peers that are connected in the form of an one-
for rare searches. Modern Gnutella protocol uses QRP tech- dimensional lattice of degree 2 and forms the initial
nique over dynamic querying in which a leaf-peer creates topology. Each of these peers initially constitutes the
a hash table of all the files it is sharing and sends that table GWebCache system.
to all the immediate ultra-neighbors. As a result, when a 3. As peers join, the GWebCache system is updated as
query reaches an ultra-peer it is forwarded to only those some proportion of ultra-peer id’s are recorded in the
connected leaf-peers which would have query hits [1,2]. GWebCaches. The entries in the GWebCache system
provide a gateway to the Gnutella network for new
2.3. Basic handshake protocol incoming nodes, as new nodes can get some initial peer
addresses to connect.
Many software applications (clients) are used to access 4. Certain number of peers arrive in bursts (we have used
the Gnutella network (like Limewire, Bearshare, Gtk-gnutel- a burst size of 25,000); arriving peers are randomly
la). The most popular client software, Limewire’s hand- assigned an ultra or leaf-peer status in a ratio as stated
shake protocol is used in our simulation as a base in Table 2. These peers randomly connect to any one of
handshake protocol. Through handshaking, a peer estab- the existing ultra-peers, the information of which is
lishes connection with any other ultra-peer. To start a retrieved from a random peer in the GWebCache
handshake protocol, a peer first collects the address of an system.
online ultra-peer from a pool of online ultra-peers. A peer 5. Peers deploy the pong-caching [5] mechanism to main-
can collect the list of online peers from GWebCache sys- tain information about other known peers; peers then
tems [4] and/or through pong-caching and/or from its use a gossip based mechanism, named ping-pong mech-
own hard-disk which has obtained a list of online ultra- anism, to obtain these cached peer addresses and sub-
peers in the previous run [8]. The handshake protocol is sequently connect to them until the maximum
used to make new connections. A handshake consists of neighbor count is reached.
three groups of headers [1,2]. The steps of handshaking 6. Nodes send connection requests to a random set of
are elaborated next: peers, whose addresses they have obtained using the
ping-pong mechanism. Connection requests are sent
1. The program (peer) that initiates the connection sends sequentially, in a non-concurrent manner. We define
the first group of headers, which tells the remote pro- the time difference between sending of two consecutive
gram about its features and the status to imply the type connection requests by a peer as a step. Thus, at each
of neighbor (leaf or ultra) it wants to be. step, a peer sends a connection request to exactly one
2. The program that receives the connection responds peer.
with a second group of headers which essentially con-
veys the message whether it agrees to the initiator’s Table 2
proposal or not. Properties of Gnutella and simulated Gnutella network
3. Finally, the initiator sends a third group of header to
Property Gnutella Simulated
confirm and establish the connection. Gnutella
No. of peers 2000k 100k
This basic protocol is modified in this paper to over-
Ultra-peer ratio 15–16% of total peers 15% of total
come the problem of message overhead. peers
Avg. diameter of ultra- 6–7 4–5
2.4. Simulated Gnutella layer
Maximum connections Ultra–ultra 32 Ultra–ultra 32
Ultra-leaf 30 Ultra-leaf 30
To generate an existing Gnutella network, we have Leaf-ultra 3 Leaf-ultra 3
simulated a strip down version of Gnutella 0.6 protocol (Applicable to
which follows the parameters of Limewire [1]. Our simu- Limewire)
lated Gnutella network exhibits all features (like degree Average duu : 25–26 duu : 22–23
Connections dul : 20–22 dul : 17–18
distribution, diameter, average path length between two
dlu : 3–4 dlu : 3
peers, proportion of ultra-peers, etc.) exhibited by Gnutella
1444 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459
7. To study the effect of node removal on the Gnutella The two steps of modified handshake protocol (HPC5) is
Network, we also implemented a node removal func- described below.
tionality in which a desired number of nodes is ran-
domly removed from the network. However, since this 1. The initiator peer first sends a request to a remote
reduces the degree of the existing peers, peers attempt ultra-peer which is not in its 1st or 2nd neighbor
to regain their lost degree by reconnecting with other set. The request header contains the type of the initi-
peers. To simulate this effect, we assume that at each ator peer. The presence of remote peer in 2nd neigh-
step, a peer that has lost a neighbor sends connection bor set implies the possibility of 3-length cycle. In
request to exactly one peer. If the connection request Fig. 2, peer-1 cannot send request to peer 2 or 3, on
is from a valid peer, i.e. the node can be accepted as a the other hand peers 4 and 5 are eligible remote
neighbor, then the connection request is accepted. At ultra-peers.
the end of a step, all peers connect to valid peers from 2. The recipient replies back with its list of 1st neighbors
whom a connection request has arrived. We assume and the neighbor-hood acceptance/rejection message.
that at each step, a peer that has lost an edge connects If the recipient peer rejects the connection in this step,
to at least one new node; hence if it has lost D degrees, the initiator closes the connection and keeps the
then it requires a maximum of D steps to restore its lost record of neighbors of the remote peer for future
degree. handshaking process. On acceptance of the invitation
by the recipient peer, the initiator checks if at least
one common peer between its 2nd neighbor set (say,
The peer selection strategy during bootstrapping can A) and the 1st neighbor set of the remote peer (say,
play an important role in the search performance. Several B) exists. A common ultra-peer between sets A and B
gossip based strategies exist for random peer sampling indicates the possibility of 4-length cycle.
[7,21,23]; however, to maintain parity, we implement the If no common peer is present between sets A and B
pong-caching mechanism that is actually proposed in then the initiator sends accept connection to remote
Gnutella. We ran our simulator on a server having DP Intel peer.
Dual Core Xeon 3.2 GHz 5060 processor with 4GB DDR2 Otherwise the initiator sends reject connection to
RAM, and could simulate for a network size of 1 million remote peer.
nodes.
In the next section we discuss the proposed HPC5 HPC5 prevents the possibility of forming a cycle of
mechanism in details. length 3 or 4 and generates a cycle-5 network.
4. Hurdles in implementing the scheme
3. HPC5: Handshake protocol for cycle-5 networks
Before embedding the simple HPC5 in Gnutella net-
Fig. 2 illustrates the proposed HPC5 graphically. In
work, several important questions have to be taken into
Fig. 2, peer-1 requests other online ultra-peers to be its
consideration to assess its viability. The most important
neighbor, given that, peer-2 is already a neighbor of
of them are listed below.
peer-1. In Fig. 2a and b, the possibility of the formation
of triangle and quadrilateral arises if a 1st or 2nd neighbor
1. Is this scheme compatible with the current population
of peer-2 is selected. However, this possibility is discarded
of Gnutella network?
in Fig. 2c and a cycle of length 5 is formed.
2. On average, how many trials are required to get an
Each peer maintains a list of its 1st and 2nd neigh-
ultra-neighbor?
bors, which contains only ultra-peers (because a peer
3. Is there any possibility of an inconsistency and if so,
only sends request to an ultra-peer to make neighbor).
how can such inconsistency be removed?
The 2nd ultra-neighbors of a leaf-peer thus represent
the collection of neighboring ultra-peers of its adjacent
Each of the questions are discussed one by one.
ultra-peers. To keep an updated knowledge, each ultra-
peer exchanges its list of 1st neighbors periodically with
its neighbor ultra-peers and sends the list of 1st neigh- 4.1. Compatibility with the current population of Gnutella
bors to its leaf-peers. To do this with minimal overhead,
piggyback technique can be used in which an ultra-peer From ultra-peer point of view, the total number of ul-
can append its neighbor list to the messages passing tra-leaf connections is U dul and from leaf-peer point of
through it. view it is L dlu . By equating both, we get
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
(a) (b) (c)
Fig. 2. Selection of neighbor by peer-1 after making peer-2 as a neighbor.
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1445
U dul ¼ L dlu 2nd (ðk 1Þðduu 1Þ ultra-peers) ultra-neighbors as a po-
U dul ¼ ðN UÞ dlu tential neighbor and this exclusion is locally done by check-
ð1Þ ing the 1st and 2nd neighbors lists of P. The number of
ðdul þ dlu Þ
N¼U ultra-peers excluded ðU 0 Þ is then ½ðk 1Þ þ ðk 1Þðduu 1Þ.
dlu
According to step-3 of HPC5, P cannot make neighbor
Fig. 3 represents a part of an ultra-peer layer where P has from any ultra-peer of level S (3rd ultra-neighbors of P) of
immediate neighbors at level Q. Suppose, P is already con- Fig. 3 as its neighbor which are U 00 ¼ ½ðk 1Þðduu 1Þ2 in
nected with ðduu 1Þ number of ultra-peers at Q level and number.
wants to get duu th ultra-neighbor. According to HPC5, P Therefore, total ½U 0 þ U 00 number of ultra-peers are ex-
should not connect to any ultra-peer from R or S level as cluded. So hit ratio can be given as,
its next neighbor. However, T can be a neighbor of P. Thus,
U ðU 0 þ U 00 Þ
we can say that if P wants to make a new ultra-neighbor then Hk ¼
P has to exclude at most ðduu 1Þþðduu 1Þ2 þðduu 1Þ3 num- U U0
2
ber of ultra-peers from Q, R and S level, respectively. So, total Assuming U 0 U; U 0 U 00 and U 00 duu ðk 1Þ
3
½ðduu 1Þþðduu 1Þ2 þðduu 1Þ3 duu number of peers cannot Therefore Hk becomes
be considered as next neighbor(s) of P. Therefore the number 2
of ultra-peers in the network needs to be at least U duu ðk 1Þ
Hk ð4Þ
U
3
U duu ð2Þ
The upper bound of k and consequently average ultra-de-
From Eqs. (1) and (2) we get gree differs in leaf-peer and ultra-peer. To generalize fur-
3 ther calculations, let m be the average ultra-degree of a
ðdul þ dlu Þ duu peer. So, average hit ratio is
N ð3Þ
dlu " #
m m 2
1 X 1 X duu ðk 1Þ
Presently Gnutella network is having the population of al- hHi ¼ Hk ¼ 1
m k¼1 m k¼1 U
most 2000k of peers at any time [1]. From Eq. (3) it can be
seen that for the present values of duu ; dul and dlu (Table 2), 2
duu ðm 1Þ
120–130k peers are sufficient to implement HPC5 protocol. ¼1 ð5Þ
2U
However, to form cycle-6 networks (HPC6) the number of
peers Eq. (5) shows the average hit ratio of peer joining the net-
work when the population of ultra-peers in the network is
4
ðdul þ dlu Þ duu U. It also reflects that ð1 hHiÞ is inversely proportional to
N
dlu the number of ultra-peers (U) in the complete network.
Now as each node joins, the network grows. As a result
required is more than 2000k. Hence the current population
the average hit ratio changes with the network growth.
will not be able to support any such attempts.
Therefore evolved hit ratio is the average value of all aver-
age hit ratios which are calculated at each growing stages
4.2. Hit ratio
of the network. Let U 0 and U n be the number of ultra-peers
in the initial and final networks. So, evolved hit ratio is
Hit ratio is defined as the inverse of the number of trials
required to get a valid ultra-peer neighbor. As our protocol Un 2 Un
1 X d ðm 1Þ X 1
puts some constraints on neighbor selection, a contacted hHev i ¼ hHi ¼ 1 uu
U n U 0 U ¼U 2 ðU n U 0 Þ U ¼U U i
agreeing remote ultra-peer may not be selected as neigh- i 0 i 0
bor. Mathematically, on an average if a peer (say, P) is look- Now, ð6Þ
ing for its kth ultra-neighbor and only the mk th contacted
ultra-peer satisfies the constraints and becomes kth ul- Un
1 Un
X
tra-neighbor of P, then the hit ratio for kth neighbor will log
U ¼U
U i U0
be Hk ¼ m1 . We first make a static analysis of hit ratio, then i 0
k
fine tune it considering that the network is evolving. Therefore, Eq. (6) becomes
At the time of the kth ultra-neighbor selection in HPC5, a 2
peer (say, P) does not consider its 1st (ðk 1Þ ultra-peers) and duu ðm 1Þ
hHev i 1 logðU n =U 0 Þ ð7Þ
2 ðU n U 0 Þ
From Eqs. (5) and (7) we get,
hHev i 6 hHi
As d and the maximum value of m are bounded, the value of
P Q R S T
hHev i increases with U. Again we have tested this
phenomenon through our simulation and plotted the
evolved hit ratio against the network size varying from
200k to 1000k in Fig. 4 and observed the similarity
Fig. 3. A part of an ultra-peer layer, where a node represents all nodes between them. The similarity is not pronounced in the begin-
that are present in that level. Like, Q represents all 1st neighbors of P. ning as the approximations made to develop Eqs. (4) and (7)
1446 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459
Empirical and Theoretical Hit Ratio of a peer Similarly smaller cycles may be created when multiple
1 peers contact each other as a directed cycle (as in Fig. 5b)
Theo: Evolved Hit Ratio within the period of two successive updates.
0.8 Emp: Evolved Hit Ratio
In the next section, we provide a detailed analysis of the
topological structure arising due to node failure; however,
Hit Ratio
0.6
the analysis will be analogous for inconsistencies arising
0.4
due to bursty peer arrival. We analytically derive the aver-
age number of short length cycles formed in the ultra-peer
0.2 level due to inconsistencies arising from node failure and
validate our results using simulation. We then propose
0 an algorithm to handle these inconsistencies, supported
400000 600000 800000 1e+06 by extensive simulation results to validate the efficiency
Number of Peers
of our algorithm.
Fig. 4. Hit ratio of a peer against the number of peers in the network.
are significant in smaller networks. However, as the number 5.1. Inconsistency arising in the face of failure/attack
of peers increases, both hHi and hHev i converges to the same
value. Here, we discuss about the topological status of the net-
work when x fraction (where x 1) of the nodes leave the
network. Our analysis deals with the topological structure
5. Consistency problems in HPC5 at the ultra-peer level; this is because, with QRP, search
queries are not forwarded to the leaf-peers, hence the
For HPC5 to work correctly, each peer must send correct number of short cycles at the ultra-peer level affect the
information about its neighbors to other peers that want to traffic redundancy in the network. Thus, in all our deriva-
connect. To facilitate this process, peers can periodically tions in this section, the term, peer, refers to an ultra-peer
exchange the list of their neighbors. Periodically exchang- in the network. We derive the distribution of the number
ing the list of neighbors facilitates the peers to get up-to- of 3 and 4-length cycles formed around an ultra-peer,
date information about their neighbors. However, this when a small fraction, x, of the peers leave the network.
leads to a situation when parallel update might occur, Then we go on to derive the total number of such cycles
where in between two successive updates, a peer may pos- formed in the network. We provide simulation results to
sibly have erroneous knowledge about it’s neighbors. As a test and compare the validity of our models. We assume
result, this inconsistency of the network leads to the pres- that the nodes have left uniformly from different parts of
ence of 3-length or 4-length cycles. Parallel update is pos- the network. So each peer loses a fraction of its neighbors
sible when many peers enter simultaneously, like during and in effect the average degree of a peer in the network
the bootstrap phase [20], or when there is a huge failure/ becomes less. To maintain the degree distribution of the
attack in the network whereby many nodes have lost their network, each peer contacts other remote ultra-peers to
neighbors and would now like to quickly gain some. fulfill neighbor deficiency. During this process, 3-length
In parallel update, due to inconsistency, short length cy- and 4-length cycles are created temporarily due to incon-
cles are formed as multiple peers in the same cycle hand- sistency between two successive updates. As defined ear-
shake in parallel with a third common ultra-peer and lier, we consider the time difference between sending
become each other’s neighbor. An instance of a parallel up- two consecutive connection requests, by a peer to other
date situation is illustrated through an example (Fig. 5a) peers, as a step; we assume that at each step a peer sends
where peer-1 and peer-5 execute the following actions a connection request to exactly one another peer, however,
according to steps of HPC5 and form cycle-3. it receives connection requests from many peers. After the
removal process, U rem ¼ ðU xUÞ number of ultra-peers re-
1. Both peer-1 and peer-5 find that peer-P is a valid main in the network. Since, we have assumed that the
remote peer to contact and both send request to P. peers have been removed uniformly throughout the net-
2. Peer-P gets their request more-or-less at the same time work, the average ultra-neighbors of an ultra-peer will
get reduced to d ^uu duu ð1 xÞ. To restore the ultra-neigh-
and sends back the neighbor-hood status to them.
3. As peer-1 and peer-5 do not know each other’s activity bor count, each ultra-peer tries to connect to new ultra-
or updated status, they make P as their new neighbor, peers. We assume that, on average, each peer connects to
therefore a cycle-3 is formed due to this inconsistency. xduu new ultra-peers so as to restore the average number
of ultra-neighbors to duu . According to HPC5, a peer (say
2 1 peer-1) cannot make any ultra-peers at level Q, R or S in
1
2
Fig. 3 as its neighbor. With an average ultra-neighbor of
3 ^uu , the number of ultra-peers from which a peer has to
P d
5 4 3 choose a neighbor is
(a) (b) ^uu þ d
^uu ðd
^uu 1Þ þ d
^uu ðd
^uu 1Þ2 ;
Nrem ¼ U rem ½d
Fig. 5. A part of cycle-5 network, representing parallel update ^3 þ d
^2 d
^uu :
inconsistency. ¼ U rem duu uu
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1447
We define this set from which a peer can choose a neighbor the same peer. Two combinations are considered as
as the potential neighbor set of the peer. Hence N rem repre- isomorphic with respect to peer-1 (Fig. 6a), if removing
sents the size of the potential neighbor set of a peer; the the labels of the other peers, (i.e. peer-5 and P in this case),
potential neighbor set of every peer can be assumed to makes both the graphs similar to each other. Thus a 3-
be approximately same in the limit of large N rem . We next length cycle can be formed due to three major cases.
find the average number of new connections made by a
node at each step, i.e. the average degree that is restored. 1. Each of the three peers, peer-1, peer-5 and P sends con-
At each step, a connection is made to peer-1 if peer-1 sends nection request to each other. Thus we have 23 ¼ 8
a connection request and finds a valid peer, or any valid combinations of directed edges, out of which six possi-
peer sends a connection request to peer-1. The probability ble combinations are invalid and the rest two are iso-
that peer-1 finds a valid peer at a step is NUrem rem
and the prob- morphic cases; hence there is one possible way in
ability that any valid peer connects to peer-1 is which a 3-length cycle can be formed due to this case.
1
N rem Nrem NU rem
rem
¼ NUrem
rem
. Thus the average number of connec- This case is shown in Fig. 6a. Some of the invalid cases
tions formed by peer-1 at each step is C ¼ 2 NUrem rem
and the are shown in Fig. 6b.
average number of steps required to restore the average 2. There is an existing edge between peer-5 and P, and
is approximately these two nodes simultaneously connect to peer-1.
The number of such combinations of directed edges
xduu U rem xduu are 4, out of which one is invalid and two are isomor-
S¼ ¼ ð8Þ
C 2Nrem phic. Thus a cycle can be formed due to this case in
two possible ways (Ref. Case-2 in Fig. 7a).
We now derive the distribution of the number of 3 and 4-
3. An edge exists between peer-1 and either P or peer-5.
length cycles formed around a peer at each step and then
Thus 3-length cycles are possible due to four possible
find the average number of cycles formed in the network.
combinations of directed edges out of which one is
It is similar to finding certain prohibitive network motifs
invalid, leading to three possible ways (Ref. Case-3 in
[14] that might arise due to inconsistent updates.
Fig. 7a).
5.2. 3-Length cycle We now find the probability that a 3-length cycle forms
around peer-1 for each of the above three cases.
A 3-length cycle is created if three peers get involved in Case 1. In case 1, the peers connect to each other in a
HPC5 as in Fig. 5. The initiation of handshake protocol in cyclic order. The probability that peer-1 connects to a ran-
different combinations among three peers may create a dom node (say peer-5), which in turn connects to another
triangle. We attempt to model the average number of random peer (say P) that connects back to peer-1 is
3-length cycles formed around a peer, and the average approximately
number of such cycles in the whole network. Considering
3
3 peers named, peer-1, peer-5 and P, as shown in Fig. 5a, 1 1
ð3Þ
at each step, a 3-length cycle can be formed around peer- A1 ¼ ðN rem ÞðNrem 1Þ ð9Þ
Nrem Nrem
1 by various combination of connection requests made
by each of the three peers. If we assume that no two peers Case 2. In case 2, the probability that two given peers,
simultaneously send connection requests to each other at which are themselves connected,
2both can connect to
1
any step, the various ways of forming a cycle among the peer-1 in any possible way is Nrem . Since each peer has
three peers can be represented by connecting the peers ^
an average degree of duu , the average number of edges
^
using directed edges as shown in Fig. 5. shared between N rem peers is Nrem2duu , which is the possible
A directed edge from peer-1 to peer-5 implies that peer- number of neighboring peers that can simultaneously at-
1 has sent a connection request to peer-5. We consider a tempt to connect to peer-1 forming 3-length cycles.
combination of directed edges as invalid (Fig. 6b), if one Assuming, the probability that more than one pair of adja-
peer sends connection requests to two different nodes at cent peers connects to peer-1 is negligible, the probability
one single step, i.e. no two directed edges go out from that a 3-length cycle formed due to case 2 is
Peer-1 Peer-1 Peer-1 Peer-1
Peer-5 P Peer-5 P Peer-5 P Peer-5 P
Fig. 6. Case 1. Graphs in (a) are isomorphic with respect to peer-1, removing the labels peer-5 and P from both these graphs makes the two graphs
absolutely similar. Isomorphic combinations can be considered as single combination with respect to peer-1. This combination also represent case 1 of 3-
length cycle formation. Graphs in (b) are invalid as in first case peer-5 sends connection requests to two peers simultaneously (represented by the directed
edges from peer-5), whereas in the second figure, peer-1 sends two connection requests at the same time.
1448 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459
peer- peer- peer- peer-
peer- peer- 1 1 1 1
1 1
Peer Peer Peer Peer Peer Peer Peer Peer
-2 -3 -2 -3 -2 -3 -2 -3
peer peer
P P Peer Peer Peer Peer
-5 -5
-4 -4 -4 -4
Case 2 Case-1 Case-2 Case-3 Case-4
peer- peer- peer-
peer peer peer 1 1 1
-1 -1 -1
Peer Peer Peer Peer Peer Peer
-2 -3 -2 -3 -2 -3
peer peer peer
P P P Peer Peer Peer
-5 -5 -5
-4 -4 -4
Case 3 Case-5 Case-6 Case-7
Fig. 7. Possible cases of 3 and 4 length cycles.
^
Nrem d ^
d each other in the same step is zero, thus in this case, a
ð3Þ uu uu
A2 ¼ 2
ð10Þ 4-length cycle can be represented by various combina-
2ðNrem Þ 2N rem
tions of the four directed edges among these peers. Out
There are two possible valid combinations for formation of of the 16 possible combinations, 14 combinations are
a cycle due to case 2, and thus each will have an occurrence invalid, and two combinations are isomorphic to each
^
probability equal to 2Nduurem . other with respect to peer-1.
Case 3. For each possible valid combination in case 3, 2. An edge exists between peer-3 and peer-4. 4-Length
the probability that a random node connects to peer-1 is cycles will be formed if one of these peers connects to
^ uu
1
and also connects to any neighbor of peer-1 is Ndrem . Thus peer-1 and another peer connects to a peer, say peer-
Nrem
the probability of forming a 3-length cycle around peer-1 2, that in turn connects to peer-1. However, there can
due to any valid combination of case 3 is be several possible combinations by which these con-
nections can be made. The number of such possible
^uu
d combinations are 23 ¼ 8, out of which four are invalid.
ð3Þ
A3 ¼ ð11Þ Thus 4-length cycles can be formed due to this case in
Nrem
four possible ways.
ð3Þ ð3Þ ð3Þ
Considering each of the probabilities, A1 ; A2 and A3 to 3. There is an existing edge between peer-1 and any of the
be small in the limit of large N rem , the average number of other peers, say peer-2. Similar to the previous case, the
3-length cycles formed around peer-1 after it has restored number of such possible combinations are 23 ¼ 8, out of
its average degree is which four are valid, leading to four possible ways of
ð3Þ ð3Þ ð3Þ
forming a 4-length cycle.
Að3Þ ¼ S A1 þ 2A2 þ 3A3 ð12Þ
4. Two edges are existing among three peers and none of
and the total number of 3-length cycles in the whole net- the edge belongs to peer-1, i.e the peers – peer-2,
work is given by, peer-3 and peer-4 – are already connected and nodes
peer-2 and peer-4 attempt connecting to peer-1. The
1 ð3Þ number of such possible combinations are 22 ¼ 4, out
L3 ¼ A U rem ð13Þ
3 of which one is invalid and two are isomorphic. Thus
a 4-length cycle can be formed due to this case in two
5.3. 4-Length cycle possible ways.
5. Two edges are existing, both of which are from peer-1
Similar to the case of 3-length cycle, formation of a 4- that connects to peers, say peer-2 and peer-3. A 4-length
length cycle around a peer involves connections among 4 cycle will be formed if both peer-2 and peer-3 connect to
peers, namely, peer-1, peer-2, peer-3, and peer-4. The def- the fourth peer, i.e. peer-4 in any possible combination
initions for invalid combinations and isomorphic combina- of directed edges. Similar to the previous case, the num-
tions remain same as defined previously. We enumerate ber of such possible combinations are 22 ¼ 4, out of
the possible ways in which a 4-length cycle can be formed which one is invalid and two are isomorphic, leading
around peer-1. For each of these cases, we represent one of to two possible ways of forming a 4-length cycle.
the possibilities in Fig. 7b. 6. Two edges are existing, one of which belongs to peer-1
and another does not, i.e an edge connects peer-1 and
1. Each of the four peers have no connection among them- another peer, say peer-2 and another edge connect
selves and randomly select each other so as to form a peer-3 and peer-4. The number of possible edge combi-
cycle. Similar to the case of 3-length cycles, we assume nations in which a 4-length cycle is formed is 22 ¼ 4,
that the probability of two peers sending request to with no invalid or isomorphic cases.
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1449
7. Two edges are existing, one belonging to peer-1, con- Total number of 3 and 4-length cycles
necting to any of its neighbor, say peer-2, and another Model
Simulations
connecting peer-2 to any other peer, say peer-4. The pos- 1000
Total number of 3 and 4-length cycles
sible number of edge combinations in which a 4-length
cycle is formed is 22 ¼ 4, out of which one is invalid, thus 800
leading to three possible ways of cycle formation.
600
We derived the probability of forming a 4-length cycle
around peer-1 for each of these cases in the limit of large
N rem , as stated in Table 3. We find the probabilities of 400
occurrence of 4-length cycles due to cases 1, 2 and 3 to
be significantly smaller in the limit of large N rem . Thus, 200
there are 11 major possibilities of occurrence of 4-length
cycles; considering these occurrence probabilities as equal
^2
d 1 2 3 4 5 6 7 8 9 10
to 2Nrem
, the number of 4-length cycles formed around a
Fraction of peers removed (x)
peer at each step can be modeled as a binomially distrib-
uted random variable with the probability mass function Fig. 8. The total number of cycles formed in the network in HPC5 due to
11i inconsistent updates for a network with 200k total peers and 30k ultras.
11
i
^2
d ^2
given as 1 2Ndrem . The average number
i 2N rem
the potential neighbor set of two random nodes is approx-
^2
11d
of cycles thus formed at each step equals 2Nrem
. However, imately same, which is true when Nrem is large. With
according to Eq. (8), the number of steps required for a increasing values of x, the assumption of N rem being very
peer to restore its average degree is represented as S, large, does not hold. A lower N rem value implies a large
hence the average number of cycles around peer-1 after overlap of the neighbor sets of any two random peers; this
it has restored its average is results in formation of more number of short-length cycles
^2 than is actually predicted by our model. Similarly, for very
11Sd
Að4Þ ¼ ð14Þ low values of x, we find a small variation of the simulation
2N rem results with our model. This is due to an effect that we
Thus the average number of cycles in the whole network term as the interconnection effect. In practice, the average
equals degrees actually lost by the remaining peers is slightly
lower than that assumed in the model. This is because,
1 11Sd ^2 11Sd ^2
L4 ¼ U rem ¼ U rem ð15Þ we have assumed that the nodes which leave have no con-
4 2Nrem 8Nrem nection within themselves, which is not strictly true. How-
and the total number of short length cycles formed in the ever, as value of x increases, the impact of overlapping
network is, neighbor set effect becomes more prominent, the balance
is noticed for moderate values of x (4–6%), where simula-
L ¼ L3 þ L4 ð16Þ tion results match perfectly with the model values.
We simulated the HPC5 protocol to find the total number We find that the total number of short-length cycles
of cycles in the network, and compared the results with formed when 10% of the peers have been removed is
our model as shown in Fig. 8. The number of peers consid- around 1025, which is quite small compared to the size
ered for our simulations was 200k, having nearly 30k ultra- of the network. Although the number of cycles formed
peers. We found that, as per our assumptions, for low to due to inconsistency in HPC5 is small, however, in the next
moderate values of x, the simulation results match well section, we provide an outline of detecting and removing
with the model values and deviates when the values of x small length cycles arising due to inconsistency.
are large. This is due to an effect that we term as the over-
lapping neighbor set effect. In our model we assumed that
6. Handling consistency problems
Table 3
The table states the derivation results of the probability of occurrence of a Theoretically, as seen from Eq. (16), in HPC5 the proba-
4-length cycle around a peer due to each of the above stated cases. The bility that a peer will receive duplicate queries due to
third column indicates the number of ways of forming a cycle for a
inconsistent updates is quite small. For a network with
particular case.
200k peers and 30k ultra-peers, the simulation results
Case # Prob. of occurrence # of ways show that the total number of short length cycles formed
1 1
Nrem
1 in the network, when 10% of the total peer departs, are
2 ^uu
d 4 only around 1025. This implies that the number of redun-
2N rem
3 ^uu
d 4 dant messages generated by any peer in HPC5 due to
Nrem
4 ^2
d 2
inconsistency is quite less, and this level of inconsistency
uu
2N rem can arise only if all the 10% of the nodes attempt to regain
5 ^ ðd
duu
^ 1Þ
uu 2
2Nrem their links simultaneously. However, to handle such rare
6 ^
d
uu
2
4 flash crowd type situation, we propose an inconsistency
2N rem
7 ^2
d
uu 3 handling procedure to handle these inconsistencies. Every
Nrem
ultra-peer maintains a record of the number of duplicate
1450 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459
queries received over the total number of received queries Estimate the mean x and variance S2 of the
in a particular time interval, a ratio that we term as dupli- duplicate query ratio of the network from the
cate query ratio. When the duplicate query ratio at an ultra-
values of di
peer P (say dP ) crosses a threshold value (say dt ), the peer
Test the hypotheses H0 : l ¼ lc ; H1 : l < lc for a
considers that inconsistency may have occurred in the net-
given level of significance a
work. Hence it becomes a coordinator and initiates an
if H1 == FALSE then
inconsistency handling protocol. Major objectives of the
Initiate CYCLE_DETECTION algorithm
inconsistency handling protocol are to predict whether
else
the extent of inconsistency has crossed a limit; in that case,
Do Nothing
detect the short length cycles that may have arisen due to
end if
inconsistency, and subsequently remove them by dropping
end if
a suitable link. The inconsistency handling protocol is di-
vided into three parts: Inconsistency estimation, cycle
detection and cycle removal. The algorithm is inherently
distributed in nature and each node runs the algorithm Algorithm 2. Algorithm CYCLE DETECTION. This
separately. We discuss each of the parts precisely. algorithm detects the short length cycles through
any peer P. Although this algorithm is initiated by
the peer P, but this is a distributed algorithm and
6.1. Inconsistency estimation
is run by each peer through which the CYCLE-
DETECT control message passes
The initial task of the coordinator – i.e. the peer that
realizes that its duplicate query ratio has crossed the SeqNo :¼ Message Sequence Number
threshold, dt – is to predict, whether the inconsistency in IDðiÞ :¼ ID of peer i
the network has also crossed a particular threshold. We re- T :¼ Message Time Stamp
fer percentage inconsistency in the network as the per- NðiÞ :¼ Neighbor set of peer i
centage of peers that are part of any short length cycle. Send a CYCLE DETECT control message with TTLð2Þ
As observed from the simulation results, the average dupli- to neighbors NðPÞ.
cate ratio of the network (say l) is highly correlated to the if A peer Q receives a CYCLE DETECT message
percentage of inconsistency in the network (refer Appen- created by some other peer then
dix A). Thus, locally estimating the value of l helps in pre- Stamp IDðQ Þ in the message
dicting the percentage inconsistency in the entire network. Stamp NðQ Þ in the message
Since it is difficult for the coordinator to obtain the average Save SeqNo, T, and ID of adjacent peer (:¼ arrID)
duplicate ratio of the whole network, l, the coordinator from whom the message has arrived
predicts l from the duplicate query ratio of the neighbor- if the CYCLE DETECT message is NOT a DUPLICATE
ing peers. Thus to obtain this estimate, the peer queries the message then
neighbors about their duplicate query ratio, calculated If TTL ¼¼ 1 then set TTL :¼ 0 and forward to
over a specific interval of time, by sending a query mes- neighbors NðQ Þ, else do nothing
sage. The neighboring peers respond by sending their indi- else
vidual duplicate query ratios, and subsequently, an if TTL ¼¼ 0 then
estimated average duplicate ratio, x, is calculated. Using Set TTL :¼ 2 and send the duplicate message
x, the average duplicate ratio, l, is tested with a given level through arrID from where the message arrived
of significance, a ¼ 0:05, to determine whether l is less else
than a threshold value lc . If the average duplicate query Set TTL :¼ 1 and forward the duplicate
ratio, l, crosses the threshold, lc , the cycle detection and message through arrID from where the message
cycle removal algorithms are initiated to detect and re- arrived and also to other neighbors NðQ Þ
move the three or four length cycles, respectively. end if
end if
Algorithm 1. Algorithm end if
INCONSISTENCY ESTIMATION. This algorithm if Peer P receives duplicate CYCLE DETECT messages
estimates the average duplicate ratio of the sent by itself then
network and checks whether it exceeds a Initiate CYCLE REMOVE algorithm
threshold. Depending upon the results, it initiates end if
CYCLE DETECTION algorithm
dP = duplicate query ratio at peer P
dt = threshold duplicate query ratio 6.2. Cycle detection
lc = threshold average query ratio
The major objective of this algorithm is to detect the
if dP > dt then
edges of the cycles that have arisen due to inconsistency.
Send query to all neighbors requesting their
If l P lc for significance level a, then the coordinator at-
duplicate query ratio di
tempts to detect those nodes that are involved in a short
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1451
length cycle with it by initiating this algorithm. For this, du is the sum of the maximum number of ultra and leaf-
the coordinator sends a special TTLð2Þ control message, neighbors of an ultra-peer.
which we name as CYCLE-DETECT message, to its adjacent
peers along with a time-stamp. On receiving this message, Algorithm 3. Algorithm CYCLE REMOVE. This
each adjacent peer records in its own table, the message algorithm is initiated by a coordinator peer P to
sequence number, the time-stamp, and the id of the peer remove a suitable cyclic link of an already
from which the message has arrived. This record helps in detected short length cycle in order to break the
identifying duplicate messages. Further, each adjacent peer cycle
attaches in the message, its id and the number of its first
Extract the cyclic path p from the message through
neighbors, and forwards the message to its adjacent neigh-
which the peer is traversed
bors – other than the one from whom it has received the
Find the sum of the neighbor count of the endpoint
message – reducing the TTL by 1. If a peer receives a dupli-
peers of every link of cycle p
cate CYCLE-DETECT message, it implies the existence of a
Select the link lij with minimum sum that connects
short length cycle. The information about the nodes in
peers i and j
the cycle has to be sent to the coordinator. Hence, the peer
if lij connects to a leaf-peer then
receiving the duplicate control message attaches its id and
the number of its first neighbors in each of the received repeat
message, and sends back each message through the same Select the link lij in path p with next minimum
path from which the message had arrived. Since each peer sum
has recorded the sending peer, they forward the duplicate until i and j are both ultra-peers
control message through the same path in the reverse end if
direction towards the coordinator peer. The messages fol- Send message to peers i and j to drop the link
low the reverse path and reach the coordinator who there-
by obtains the topology information. It then runs the short
length cycle removal algorithm to identify those links that
should be dropped to remove all short length cycles. 7. Evaluation by simulation
To validate our approach, we have performed numerous
6.3. Cycle removal experiments. We have taken different sizes (up to 100k
nodes) of networks and performed experiments on those
The cycle removal algorithm is simple; the coordinator networks several times to obtain the average behavior.
decides to drop the link that will affect the minimum num- Through these experiments, we have seen that HPC5 per-
ber of nodes. When a coordinator receives back the dupli- forms better than the existing protocols.
cate CYCLE-DETECT control messages sent by itself, the
coordinating peer extracts the information about the num- 7.1. Search performance
ber of first hop neighbors of each node from the control
messages. To decide which link to drop, for every link it The efficiency of search algorithms can be measured
calculates the number of peers that will use the link by tak- using various metrics such as success rate, average num-
ing the sum of the first hop neighbor count of the two ber of hops required to get results (response time), mes-
nodes of the link, and then drops the link with the mini- sage complexity, network coverage (number of nodes
mum sum value. However, if the link connects to a leaf- discovered), percentage of message duplication, etc.
peer, then it does not drop that link and considers the next [12]. In our simulations, we use message complexity
best link that connects two ultra-peers. Links to leaf-peers and network coverage as performance metrics. Message
are not disconnected as leaf-peers are connected to only complexity is defined as the average number of messages
three or four ultra-peers, and reducing their degree by required to discover a peer in the overlay network. Net-
one will have a big impact on their performance. The coor- work coverage implies the number of unique peers ex-
dinator peer sends a message to all the peers involved in plored during query propagation in limited flooding.
the cycle informing the link to be dropped. All the peers We plot the network coverage and message complexity
update this information and the peers corresponding to (y-axis) with TTLð2Þ and TTLð3Þ flooding against the size
the link propagate this information to all their neighbors. of the network (x-axis). We also show the performance
The pseudocode is presented in Algorithm 3. The perfor- separately for the query originating from leaf-peers and
mance of the algorithm depends upon the proper choice ultra-peers. To get the overall performance of the net-
of the threshold average duplicate query ratio lc . If the work, we choose the number of ultra-peers and leaf-
threshold is kept too high, the number of redundant mes- peers for query flooding in the same UL ratio. The perfor-
sages in the network might increase enormously, whereas, mance of the network is greatly influenced by the value
a lower value might lead to unnecessary deletion of links of TTL used in search and thus we discuss the perfor-
leading to a degradation in performance. Thus we need mance metrics based on TTLð2Þ and TTLð3Þ separately.
to have a tradeoff. In Section 7.6, we experimentally derive The search performance (specially message complexity)
an idea of lc for certain request patterns. It can be seen also depends on the implementation of QRP technique
that the asymptotic bound on the number of messages [1]. Thus we discuss the search performance without
2
transmitted for handling an inconsistency is Oðdu Þ, where and with QRP.
1452 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459
7.2. TTL(2) without QRP double of the coverage attained in cycle-3 networks. If
any pair of ultra-neighbors of a particular leaf-peer is not
It is clear from Fig. 9a and b that with TTLð2Þ, cycle-5 more than 6 hops apart from each other (in case of
networks are better than cycle-3 networks in both mes- TTL(2) it was 4 hops), then query generates redundant
sage complexity and network coverage. In cycle-5 net- messages. In cycle-5 networks the probability of forming
works, the network coverage is approximately doubled cycle-5 and cycle-6 is very high. As a result, the message
and message complexity is almost 20% less than that of cy- complexity of cycle-5 networks becomes higher than cy-
cle-3 networks. With TTLð2Þ, a search query covers a signif- cle-3 networks (Fig. 11b). As mentioned earlier, Gnutella
icant portion (in our simulation it is more than 30%) of the (Limewire etc.) uses TTLð3Þ in dynamic querying only for
cycle-5 network with lesser number of redundant mes- rare searches [1,2]. Therefore larger network coverage in
sages. From the results we see that the message complex- this case, which increases the query hit probability, is more
ity is not as close to 1 as expected. This is because the essential than slight increase of message complexity.
message complexity of the leaf-peer generated query is
particularly high (Fig. 9b). In cycle-5 networks, a leaf-peer 7.4. TTL(2) and TTL(3) with QRP
can be connected with two ultra-peers which are them-
selves 3rd or 4th neighbors of each-other and becomes part With QRP technique, searching is performed only at the
of cycle-5 or cycle-6 (Fig. 9). From Fig. 9 we see, a leaf-peer ultra-peer layer, since ultra-peers contain the indices of
search is initiated by its ultra-peers; both ultra-peers 1 and their children [1,2]. So, the measurement of message com-
4 (in Fig. 9a) {1 and 5 (in Fig. 9b)} start a TTLð2Þ flooding. plexity at the ultra-peer layer is more appropriate to com-
Consequently redundant messages are produced at ultra- pare results with Gnutella networks. The ultra-peer layer
peers 2 and 3. However, hardly any redundancy is gener- message complexity is shown in Fig. 12. Simulation reflects
ated in ultra-peer initiated query (Fig. 10b). that the message complexity in TTLð2Þ of cycle-3 networks
As a result, cycle-5 networks generate a large number of is almost 2–2.5 times than that of cycle-5 networks. Even
redundant messages which get reflected in Fig. 10b. in TTLð3Þ search, cycle-3 networks generate 25% more mes-
sages than that of cycle-5 networks. So HPC5 protocol will
be more effective in the Gnutella network in the presence
7.3. TTL(3) without QRP
of QRP protocol.
Fig. 11a shows that in our simulation, the entire cycle-5
networks is covered with TTLð3Þ search which is almost 7.5. Comparison of robustness
1 2 3 4 1 2 3 4 5
In this section we compare the robustness of the
evolved networks with that of the base network. In order
to test the robustness of the network evolved through
HPC5, we have considered networks with 50k peers and
(a) plotted the percentage of existing peers belonging to the
(b)
largest component against the percentage of peers re-
Fig. 9. Effect of leaf-peer layer with TTL(2) search. The arrows inside and moved. We have removed peers in two ways: (i) random
outside of polygons indicate the directions of search by a leaf-peer and an removal and (ii) pathologically removing the highest-de-
ultra-peer, respectively. gree nodes first [22]. The upper (right) two curves in
TTL-2: Message Complexity
2.5
Message Complexity
TTL-2: Network Coverage
2
30000
Network Coverage
1.5
20000
1
10000 20000 40000 60000 80000 100000
Number of Peers
Ultra-peer:Cycle-3 Ultra-peer:Cycle-5
0
20000 40000 60000 80000 100000 Leaf-peer:Cycle-3 Leaf-peer:Cycle-5
Overall Cycle-3 Overall Cycle-5
Number of Peers
Fig. 10. Network coverage and message complexity with TTL 2 for cycle-3 and cycle-5 networks.
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1453
TTL-3: Message Complexity
4
Message Complexity
TTL-3: Network Coverage
3
100000
80000
Network Coverage
2
60000
40000 1
20000 40000 60000 80000 100000
20000 Number of Peers
Ultra-peer:Cycle-3 Ultra-peer:Cycle-5
0
20000 40000 60000 80000 100000 Leaf-peer:Cycle-3 Leaf-peer:Cycle-5
Overall Cycle-3 Overall Cycle-5
Number of Peers
Fig. 11. Network coverage and message complexity with TTL 3 for cycle-3 and cycle-5 networks.
TTL(2) TTL(3) Mesg. Complexity at Ultra-peer layer Robustness on Random and Pathological Removal
10 Size of the Largest Component (%)
100
TTL(2):Cycle-3 90
TTL(3):Cycle-3
Message Complexity
8 80
TTL(2)::Cycle-5
TTL(3) Cycle-5 70
6 60
50
4 40 Random in Cycle-3
30 Random in Cycle-5
Patho. in Cycle-3
2 20 Patho. in Cycle-5
10
20000 40000 60000 80000 100000 0
Number of Peers 1 10 100
Percentage of nodes removed
Fig. 12. Message complexity with TTL(2) and TTL(3) at the ultra-peer
layer. Fig. 13. Comparison of robustness between cycle-3 and cycle-5 networks.
Fig. 13 represent the effect of random removal, whereas like, the number of ultra and leaf-neighbors, or the total
lower (left) two curves represent pathological removal. number of neighbors of any peer. We denote the percent-
Both cycle-3 and cycle-5 networks are extremely resilient age of inconsistent peers in the network as the percentage
to random removal and the largest connected component of peers whose neighbor lists have been modified explicitly
still contains 80% of existing peers even after removing al- by us so as to form short length cycles. We initially state
most 75–80% of nodes. The network gets fragmented after the simulation results showing the impact of inconsistent
85% peer removal. But in pathological removal, both cycle- peers on the average duplicate ratio of the whole network
3 and cycle-5 networks get fragmented only after remov- and then discuss the effectiveness of our proposed incon-
ing 6–7% of total nodes. However, through our empirical sistency handling algorithm.
study, we can conclude that robustness is similar for both
cycle-3 and cycle-5 networks.
7.7. Impact of inconsistency on average duplicate ratio
7.6. Inconsistency handling We simulated the average duplicate ratio of the net-
work with respect to the percentage inconsistency in the
We simulated the average duplicate ratio at the peers in network for a total node count of 125k and 150k. Every
the whole network for certain percentages of inconsistent peer in the network was considered as equally active, i.e.
peers in the network. A peer is termed as inconsistent if each peer had the same average query sending rate.
it is involved in a short length cycle. To simulate the effect Fig. 14 shows the results with and without inconsistency
of inconsistent peers in the network, we re-organized the handling. As can be seen, when no inconsistency handling
cycle-5 topology of the simulated Gnutella network to is applied, the average duplicate ratio of the network in-
form short length cycles without changing the parameters creases rapidly with increasing percentage inconsistency.
1454 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459
Average Duplicate Ratio for Inconsistent Gnutella Networks the peer inconsistency has crossed 2%. The percentage
0.008 No Inconsistency Handling 125K inconsistency of the network has been varied from 1% to
No Inconsistency Handling 150K
With Inconsistency Handling 125K 15% for a total network size of 125k peers. The results
0.007 With Inconsistency Handling 150K
shown in Fig. 15 reveal that the percentage correctness
Average Duplicate Ratio of Network
0.006 of the estimations are nearly 100%, except for some incon-
sistencies very near to 2%. This is because, for low percent-
0.005
age inconsistency, the actual average duplicate ratio of the
0.004 network is very close to 0.002; hence the number of incor-
rect estimations are higher compared to the case when the
0.003 actual average duplicate ratio is much higher than the
threshold.
0.002
The efficiency of the cycle detection and removal algo-
0.001 rithms is shown in Fig. 14. As the figure shows, the cycle
detection and removal algorithm successfully reduces the
0
0 2 4 6 8 10 12 14 average duplicate ratio of the whole network to the thresh-
Inconsistent Peers (%) old bound of 0.002 for all percentages of peer inconsis-
tency. However, the cycle removal algorithm deletes
Fig. 14. The graph plots the change in average duplicate ratio of the peers
l in the network with increasing percentage of minimum inconsistent
peers. The number of peers used in the simulation are 125k and 150k. The
graph shows the results when no inconsistency handling is done and also
for the case when inconsistency handling is applied beyond a threshold of No. of peers = 125K, Critical Duplicate Ratio = .002
100
average duplicate ratio lc ¼ 0:002.
% Correctness of Estimated Results
80
From the Pearson product-moment correlational coefficient
of the simulated results (refer Appendix A), we find that 60
the average duplicate ratio and the percentage inconsis-
tency of the network exhibit a very high linear depen- 40
dency. Thus the average duplicate ratio of the network
can be used to predict the percentage inconsistency in
20
the network, i.e. we can claim that when the average dupli-
cate ratio of the peers in the network, l, exceeds a thresh-
old lc , it implies that the percentage inconsistency in the 0
0 2 4 6 8 10 12 14 16
network, i, has also exceeded a threshold value of ic and Inconsistent Peers (%)
vice versa. Hence, our proposed inconsistency handling
algorithm estimates the average duplicate ratio of the Fig. 15. Percentage correctness of the test, whether the percentage
inconsistency i has crossed a threshold value ic ¼ 2%, when the average
network to obtain an idea about the percentage inconsis-
duplicate ratio of the network has crossed a threshold value lc ¼ 0:002
tency in the network. We, now state the simulation results for 0.05 level of significance and vice versa. The percentage inconsistency
showing the efficiency of our proposed inconsistency han- was varied from 1% to 15%. The total number of peers used was 125k.
dling algorithm in predicting and reducing the average
duplicate ratio of the whole network.
7.8. Efficiency of the proposed inconsistency handling Percent Change in Average Degree of Peers
algorithm 5
% Change in Average Degree of Peers
We measure the efficiency of the proposed inconsis- 4
tency algorithm by measuring the percentage correctness
of inconsistency estimation by the peers and by the reduc-
3
tion in the average duplicate ratio of the network for differ-
ent percentage of peer inconsistency. The inconsistency
handling algorithm was divided into three parts: inconsis- 2
tency estimation, cycle detection, and cycle removal. We
simulated our inconsistency estimation algorithm and ob- 1
tained the results for percentage correctness of the estima-
tion – whether the percentage inconsistency i has crossed
0
a threshold value ic ¼ 2%, when the average duplicate ra- 0 2 4 6 8 10 12 14 16
tio of the network has crossed a threshold value Inconsistent Peers (%)
lc ¼ 0:002, for 0.05 level of significance, and vice versa.
Fig. 16. The graph shows the change in the average degree of the peers
In our simulations, we observed that the average duplicate when the inconsistency handling protocol is applied on the peers when
ratio is 0.002, when the peer inconsistency is nearly 2%. the average duplicate ratio l of the network crosses the threshold
Thus, the threshold value lc ¼ 0:002 predicts whether lc ¼ 0:002. The number of peers used in the simulation was 125k.
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1455
certain links in the network to remove inconsistencies. short length cycle formation by selecting suitable neigh-
Hence we need to study the effect of link removal on the bors, rather than removing edges, the network coverage
average degree of the peers. Fig. 16 shows the percentage is also higher as compared to DCMP. However, DCMP
change in the average degree of the peers following cycle works for any TTL values and it has been shown to perform
removal algorithm. The maximum reduction of the average well for TTL values upto 8, whereas, HPC5 supports only
degree is around 4%. Since from Table 2, we find that the TTLð2Þ flooding. We implemented the DCMP protocol on
average degree of the ultra-peers is duu þ dul 48, hence our simulator; the comparison of the message overhead
the number of edges that might be removed per peer is ex- and the network coverage with HPC5 is detailed next.
pected to be around 1–2, at most.
7.10. Comparison of message overhead
We next compare the performance of the HPC5 protocol
with the Distributed Cycle Minimization Protocol (DCMP) for
We measured the message overhead as the total num-
p2p networks.
ber of control messages sent over the network for a certain
period of time. We compare the message overhead for both
7.9. Comparison with DCMP
HPC5 and DCMP with respect to the size of the network
(number of nodes). The list of various simulation parame-
In this section, we compare the performance of DCMP
ters that we have used for the simulations are as follows:
with HPC5 in terms of message overhead and the network
coverage of the peers. The DCMP protocol reduces traffic
1. We considered that every peer sends an average of 3.6
redundancy by detecting the short length cycles dynami-
query requests per hour distributed uniformly (as con-
cally and deleting suitable edges to remove these cycles.
sidered in the DCMP paper [25]), and run the simulation
In Algorithm 4 (Appendix C), we highlight the major steps
for two hours of simulation time and various network
of the DCMP algorithm for ready reference. The perfor-
sizes.
mance comparison is based on the simulation results that
2. For both HPC5 and DCMP, we used a growing network
we obtain from implementing the DCMP protocol in our
where peers arrive in a burst of 25k after every
simulator. In the DCMP paper [25], results have been given
10 min. The growth process in each case is similar and
for at most 10k nodes, without considering certain specific
is discussed in Section 2.
details of the Gnutella network (like average degrees of the
3. The minimum Gnutella message size is 23 bytes [16];
ultra and leaf-peers, QRP, etc.). Hence to make the compar-
we consider each Node Information Vector (NIV) field
ison with HPC5, we deemed fit to implement the DCMP
of a peer A of length equal to six times the number of
protocol in the simulation platform.
neighbors of A. This is because, for each neighbor of A,
The DCMP protocol follows a lazy approach in detecting
4 bytes is allocated to represent the address of A and
and removing cycles in Gnutella networks; however, as the
2 bytes for representing the bandwidth of A, making a
gatepeers and transitive peers constantly broadcast tagged
total of 6 bytes.
messages to their TTL-hop neighbors, a fixed number of
4. In our experiments, we ignore the overhead in DCMP
control messages will always flow through the network.
due to dissemination of gatepeer information in the
Another problem with the DCMP protocol is that the dele-
form of tagged messages.
tion of edges to remove cycles might reduce the network
5. For the case of HPC5, we assume that a peer sends a
coverage of the peers in the network. In contrast, HPC5
control message of 23 bytes for a connection request
generates control messages only at the bootstrapping
to a peer. The peer responds with the list of addresses
phase, when a node joins a network. Since HPC5 avoids
Total Control Messages Transferred (in Megabytes)
Overhead Comparison of DCMP and HPC5 Comparison of Average Network Coverage : DCMP vs HPC5 (With QRP)
6 DCMP HPC5: Without Link Removal
Average Network Coverage Per Peer
450 DCMP
800 3600
HPC5 HPC5: With Link Removal
HPC5 DCMP TTL d =2
400 700
1800
4 600
350
25 50 75 100
500
300 25 50 75 100
400
250
300
200
200
150
100
30 40 50 60 70 80 90 100 30 40 50 60 70 80 90 100
Total Number of Peers in Network (in Thousands) No. of Peers (In Thousands)
Fig. 17. (a) Shows the total number of control messages transferred (in megabytes) in the whole network. The figure in inset shows the average number of
control messages sent per peer (in kilobytes). (b) Compares the average network coverage of both leaf and ultra-peers in DCMP and HPC5. The figure in inset
shows the coverage of the leaf-peers, and the main figure shows the average networks coverage of the ultra-peers in a Gnutella network. The number of
peers in the simulations is varied from 30k to 100k.
1456 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459
of its neighbors; each address along with the port num- messages can be reduced, if we suitably eliminate short
ber is assumed to be of 6 bytes. An additional overhead length cycles from the network topology. We have shown
of 23 bytes, which is the minimum Gnutella message that the protocol is far more efficient than existing proto-
size is also assumed for the latter case. cols. We also studied the various hurdles in implementing
our scheme and proposed methods to handle inconsistency
As can be seen from Fig. 17a, for a large network size in the network. The basic strength of our protocol lies in its
ðN ¼ 100kÞ, in HPC5 the total number of control messages simplicity and the ease with which it can be implemented
sent through the network (measured in bytes) is less by al- on existing Gnutella networks. The modularity of the
most 30% compared to DCMP. This is because, in HPC5, the handshaking and the inconsistency handling protocol pro-
overhead is mainly due to the response messages sent by vides a suitable choice to the designers whereby they can
the peers in reply to a connection request. With increasing decide on the use of any of these protocols without affect-
number of nodes, the probability of finding a valid peer at a ing the other one. Moreover, the inconsistency handling
particular attempt increases; thus peers require to send protocol can be tuned dynamically. Our protocol can be
fewer connection requests, thereby reducing the number instrumental in improving the scalability of Gnutella net-
of control messages. Hence, with increasing network size, works; moreover, this concept of topology management
the average number of control messages sent per peer de- can be customized for other unstructured p2p networks
creases much faster compared to DCMP (refer inset in to improve blind search behavior, resulting it in perform-
Fig. 17a). ing comparably with many proposed intelligent search
mechanisms that are generally more computation
intensive.
7.11. Comparison of network coverage
We also measured the average TTLð2Þ network coverage Appendix A. Relation between average duplicate ratio
of both ultra and leaf-peers, for both DCMP and HPC5, and inconsistent peers
when QRP is followed. We assume that DCMP preserves
all cycles of length greater than 4 and eliminates all 3 The following table indicates the result of the simula-
and 4-length cycles, i.e. TTLd ¼ 2 [25]. We compare the re- tions. If X and Y denote two parameters of an experiment
sults of DCMP with two possible cases of HPC5. and we take n measurements of the parameters denoted
by (xi ; yi ), where i ¼ 1; 2; . . . n, then the Pearson Correlation
1. In the first case, we assume that HPC5 does not remove Coefficent between X and Y is given by (see Table 4)
any link, i.e. HPC5 does not initiate the inconsistency
n xi yi xi yi
P P P
handling algorithm.
r xy ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi;
2. In the second case, to compare DCMP with the worst
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n x2i ð xi Þ2 n y2i ð yi Þ2
P P P P
case performance scenario of HPC5, we remove 4% of
the total links (as seen in Fig. 16), but from the ultra-
peer level of the network. This simulates the possible where all the summations are carried for i ¼ 1 to n. For
scenario that a large number of links have been the given data, let X denote the percentage of inconsistent
removed by the inconsistency handling algorithm of peers and Y denote the average duplicate ratio at the
HPC5, due to inconsistency. Removing 4% of the total peers; we calculated the value of r xy for 120k peers and
links reduced the average degree of the ultra-peers by
as much as 4%.
Table 4
The table shows the simulation results of the average duplicate ratio at the
As observed in Fig. 17b, in both cases, DCMP has much peers for a given percentage of inconsistent peers. The total number of
lower network coverage as compared to HPC5. This is be- peers used in the simulation are 125k and 150k.
cause, Gnutella networks form a large number of 3 and
120k 150k
4-length cycles; hence in DCMP, a high number of dupli-
cate messages are generated at each peer. Thus DCMP % Average % Average
Inconsistent duplicate ratio Inconsistent duplicate ratio
eliminates a large number of links (we observed that for
peers peers
TTLd ¼ 2, DCMP eliminates about 16% of the total links),
1 0.001889 1 0.002005
thereby reducing the average network coverage of the
2 0.002019 2 0.002171
peers. 3 0.002336 3 0.002388
4 0.002597 4 0.002654
5 0.003103 5 0.003124
8. Conclusion and future work 6 0.003366 6 0.003376
7 0.003786 7 0.003717
In this paper, we have presented a handshake protocol 8 0.004328 8 0.004315
9 0.004812 9 0.004661
which is compatible with Gnutella-like unstructured
10 0.005406 10 0.005374
two-tier overlay topology. The objective of the protocol is 11 0.005820 11 0.005725
to enhance the network coverage of the peers, besides 12 0.006051 12 0.005991
reducing redundant messages, in the existing Gnutella net- 13 0.006066 13 0.006020
14 0.008082 14 0.005996
works. Our protocol is based on the observation that net-
15 0.006214 15 0.006126
work coverage of peers can be improved, and redundant
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1457
150k which is around 0.984 for both cases. Since the val- the first neighbors leads to further SðQ Þ unique second neigh-
ues of r xy are near to 1 for both cases, so it shows that bors of Q, then for a given value of FðQ Þ þ LðQ Þ, the network
there is a very high linear dependency between the two coverage of Q, given by CðQ Þ ¼ FðQ Þ þ SðQ Þ, is maximum, if
parameters. Thus the peers can use the duplicate ratio and only if Q is not involved in any cycle of length less than 5.
of itself and the neighboring peers to estimate the aver-
age duplicate ratio of the whole network based on which Proof 2. If-part: If Q is not involved in any cycle of length less
it can decide whether to initiate the inconsistency han- than 5, then CðQ Þ is maximum. The if-part of the proof is triv-
dling protocol. ial and is directly obtained from Lemma 1. Since according
to Lemma 1, if the minimal length cycle from Q through any
Appendix B. Importance of cycle-5 networks for node v 2 V has length 5, and Dðv Þ 6 2, a TTLð2Þ flood mes-
gnutella sage from Q must reach v exactly once. This implies that all
the nodes within two hops from Q are reached exactly once
Gnutella networks use a TTLð2Þ query for most searches; from Q, i.e. all the LðQ Þ edges from first hop neighbors of Q
however, for rare searches it uses a TTL value of 3 (dynamic must reach to different and unique nodes. Hence, the num-
querying [5]). A TTLð2Þ query from a peer reaches its first as ber of unique second neighbors SðQ Þ reached is given by
well as its second neighbors. Thus, the network coverage of SðQ Þ ¼ LðQ Þ. Since SðQ Þ cannot be greater than LðQ Þ, hence,
a peer may be defined as the sum of the set of its first and for a given FðQ Þ þ LðQ Þ, the coverage of node Q will be max-
second neighbors. Suppose a peer, P, in a p2p network has imum and equal to CðQ Þ ¼ FðQ Þ þ LðQ Þ. Thus it is proved
FðPÞ number of first neighbors, and let LðPÞ denote the that if Q is not involved in any cycle of length less than 5,
number of outgoing edges from FðPÞ first neighbors of P. then CðQ Þ is maximum.
Moreover, suppose LðPÞ edges connect to SðPÞ unique peers Only-if part: If CðQ Þ is maximum, then Q is not involved in
other than P or any of its first neighbors. Then the network any cycle of length less than 5. We prove the above
coverage of P is given by the sum FðPÞ þ SðPÞ. We prove statement by method of contradiction. Suppose CðQ Þ is
that, for given FðPÞ and LðPÞ, the peer P, that uses TTLð2Þ maximum for a given FðQ Þ þ LðQ Þ, when Q is involved in
search queries, will have a maximum network coverage, some cycles of length less than 5. We note that, any pair of
if the peer P does not involve itself in any cycle of length peers will be connected by a single edge only (no parallel
less than five. Moreover, with the above stated condition, edges occur), so there is no possibility of any cycle of
the peer P does not produce any redundant message at length 2. Thus Q can be involved in cycle of either length 3
any of its first or second neighbors. We prove each of the or 4. Let us initially assume that Q is involved in a cycle of
statements using a lemma and a theorem as stated below. length 3. If Q is involved in cycle of length 3, then any two
In each case, we represent the network as a graph GhV; Ei, of its first neighbors (say N1 and N 2 ) must be connected by
where the vertex set V represents the peers in the network an edge E12 among themselves. We remove the end of edge
and the edge set E represents the overlay links between E12 connected to N 1 , and connect it to a new node, N a , that
them. is not presently a first or second neighbor of Q. Thus, for
same value of FðQ Þ þ LðQ Þ, the number of second neighbors
Lemma 1. Given a network, represented as, GhV; Ei with a
of Q increases by 1; hence we have the new coverage
source vertex Q. A TTLð2Þ flood query from the source
C 0 ðQ Þ > CðQ Þ, which is not possible as we have already
vertex Q will reach a vertex v 2 V exactly once, if the
assumed that CðQ Þ is maximum. Thus, our assumption was
minimum length cycle through Q and v has length 5, and
wrong and Q cannot be involved in a cycle of length 3.
the minimum distance of v from Q, given by DQ ðv Þ, is less
than or equal to 2. Now suppose we assume that Q is involved in a cycle of
length 4. Then any two of its first neighbors (say again N 1 and
Proof 1. Since the minimum distance of node v from Q, N 2 ) must be connected to a common peer, say N 3 . We remove
given as DQ ðv Þ 6 2, so node v must be reachable from Q the link connecting N 2 and N 3 , and add a new link from N 2 to
using a TTLð2Þ flood query. We now prove that if there a new peer N b that is not currently a first or second neighbor
exists a cycle between Q and v of minimum length 5 and of Q. Thus, as in the previous case, for a given value of
DQ ðv Þ 6 2, then there will be no duplicate query at vertex FðQ Þ þ LðQ Þ, the number of second neighbor of Q increases by
v from source Q. We prove this by method of contradiction. 1; hence the new coverage C 0 ðQ Þ > CðQ Þ, which is not
Let us assume that node v receives a duplicate query from possible as we have already assumed CðQ Þ as maximum.
source Q through two separate paths, say p1 and p2 . Then, Hence Q cannot be involved in any cycle of length 4.
the path length between Q and v through each of these Thus for CðQ Þ to be maximum, Q cannot be involved in
paths, p1 or p2 , is at most 2. Thus the cycle represented by any cycle of length less than 5. h
Q ,p1 v ,p2 Q is at most of length 4 which is a contradiction
to our initial assumption that Q and v are connected The generalization of the proof can be thought as, for a
through a cycle of minimum length 5. Thus no redundant given number of edges in a p2p network, the network cov-
messages can be produced at v. h erage of the peers can be enhanced, if we can increasingly
eliminate short length cycles from the network.
Theorem 1. Given a network, represented as GhV; Ei, with a
source vertex Q that uses TTLð2Þ based flooding mechanism Appendix C. Steps of algorithm of the DCMP protocol
for searching. Let FðQ Þ be the number of first neighbors of Q
and LðQ Þ denote the total number of outgoing edges from the We present an algorithmic outline (Algorithm 4) of the
FðQ Þ first neighbors. If LðQ Þ number of outgoing edges from DCMP protocol for ready reference.
1458 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459
References
Algorithm 4. Algorithm DCMP. Each message in
DCMP is assigned a globally unique ID (GUID), [1] Gnutella and Limewire: <URL www.limewire.org>.
every peer A on receiving a message from B [2] Gnutella protocol specification 0.6: <http://rfc-gnutella.sourceforge.
net>.
records the GUID and its direction of travel [3] Gnutella: <URL www.gnutellaforums.com>.
ðB ! AÞ in a history table. The algorithm is [4] Gwebcache system: <URL www.gnucleus.com>.
initiated by a peer A that receives a duplicate [5] How gnutella works, <URL wiki.limewire.org>.
[6] G. Chen, C.P. Low, Z. Yang, Enhancing search performance in
message (two messages with same GUID) from
unstructured P2P networks based on users’ common interest, IEEE
two different directions, say B and C Transactions on Parallel and Distributed Systems 19 (6) (2008) 821–
836.
GUID :¼ Globally Unique ID of the duplicate [7] M. Jelasity, S. Voulgaris, R. Guerraoui, A.-M. Kermarrec, M. van Steen,
message Gossip-based peer sampling, ACM Transactions on Computer
DetectionID :¼ The direction of the duplicate Systems 25 (3) (2007) 8.
[8] P. Karbhari, M.H. Ammar, A. Dhamdhere, H. Raj, G.F. Riley, E.W.
message, (say B ! A) Zegura, Bootstrapping in Gnutella: a measurement study, in: PAM,
NIVðAÞ :¼ Node Information Vector of peer A Lecture Notes in Computer Science, vol. 3015, Springer, 2004, pp.
containing the IP address, bandwidth, cpu speed, 22–32.
[9] Y. Liu, X. Liu, L. Xiao, L.M. Ni, X. Zhang, Location-aware topology
degree and the neighbors of A
matching in P2P systems, in: INFOCOM: The Conference on
On receiving duplicate messages from a source Q Computer Communications, Joint Conference of the IEEE Computer
through 2 different adjacent nodes, say B and C, and Communications Societies, 2004.
[10] Y. Liu, L. Xiao, X. Liu, L.M. Ni, X. Zhang, Location awareness in
peer A introduces a new message called,
unstructured peer-to-peer systems, IEEE-TPDS: IEEE Transactions on
Information Collecting (IC) message with same Parallel and Distributed Systems 16 (2005) 163–174.
GUID containing DetectionID; NIVðAÞ and [11] K. Lua, J. Crowcroft, M. Pias, R. Sharma, S. Lim, A survey and
forwards it to both B and C. comparison of peer-to-peer overlay network schemes,
Communications Surveys & Tutorials, IEEE (2005) 72–93.
For each node i on receiving the IC message [12] Q. Lv, P. Cao, E. Cohen, K. Li, S. Shenker, Search and replication in
repeat unstructured peer-to-peer networks, in: Proceedings of the 2002
Record the GUID and DetectionID International Conference on Supercomputing (16th ICS’02), ACM,
2002, pp. 84–95.
Add its own NIV, i.e. NIVðiÞ [13] S. Merugu, S. Srinivasan, E.W. Zegura, Adding structure to
Propagate the IC message in reverse direction unstructured peer-to-peer networks: the role of overlay topology,
from which the original message arrived. in: NGC 2003, and ICQT 2003, Proceedings, Lecture Notes in
Computer Science, vol. 2816, Springer, 2003, pp. 83–94.
until A node i receives both IC messages [14] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, U. Alon,
if peer X receives both IC messages then Network motifs: simple building blocks of complex networks,
peer X checks the NIVs’ of each peer in message Science 298 (5594) (2002) 824–827.
[15] C. Papadakis, P. Fragopoulou, E. Athanasopoulos, M.D. Dikaiakos, A.
and finds the most powerful peer (Gatepeer) Labrinidis, E. Markatos, A feedback-based approach to reduce
depending upon the degree, bandwidth and CPU duplicate messages in unstructured peer-to-peer networks, in:
speed, Integrated Research in GRID Computing, 2007.
[16] M. Portmann, A. Seneviratne, Cost-effective broadcast for fully
determines the cut position by finding the link
decentralized peer-to-peer networks, Computer Communications
whose end nodes are maximal equidistant from 26 (11) (2003) 1159–1167.
Gatepeer, [17] S. Saroiu, P.K. Gummadi, S.D. Gribble, A measurement study of peer-
sends a Cut Message (CM) to the peers in the cycle to-peer file sharing systems, Tech. rep., July 23 2002.
[18] S. Shaw, J. Chandra, N. Ganguly, HPC5: an efficient topology
containing the information about the link to cut generation mechanism for gnutella networks, in: 10th
and the IP address of the gatepeer. International Conference on Distributed Computing and
end if Networking – ICDCN, Hyderabad, 2009.
[19] D. Stutzbach, R. Rejaie, Capturing accurate snapshots of the gnutella
if the CM arrives at the Gatepeer then network, IEEE INFOCOM, 2005, pp. 2825–2830.
then the Gatepeer generates a tagged message [20] D. Stutzbach, R. Rejaie, Understanding churn in peer-to-peer
containing the NIV of the gatepeer and the networks, in: IMC ’06: Proceedings of the 6th ACM SIGCOMM
conference on Internet Measurement, ACM, New York, NY, USA,
number of hops from the message origin to the 2006, pp. 189–202.
gatepeer, [21] D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, W. Willinger, On unbiased
piggybacks the tagged message with other sampling for unstructured peer-to-peer networks, IEEE/ACM
Transactions on Networking 17 (2) (2009) 377–390.
messages that are forwarded by the gatepeer, [22] D. Stutzbach, R. Rejaie, S. Sen, Characterizing unstructured
periodically sends the tagged message; initially it overlay topologies in modern P2P file-sharing systems., in:
is sent frequently, the rate gradually slows down Internet Measurment Conference, USENIX Association, 2005, pp.
49–62.
with time.
[23] D. Gavidia, S. Voulgaris, M. van Steen, CYCLON: inexpensive
end if membership management for unstructured P2P overlays, Journal
if a peer Y receives tagged messages from more than of Networks and System Management 13 (2) (2005) 197–217.
one direction then [24] S. Zhao, D. Stutzbach, R. Rejaie, Characterizing files in the
modern gnutella network: a measurement study, in: Proceedings
peer Y becomes a transitive peer, of SPIE/ACM Multimedia Computing and Networking, vol. 6071,
it also generates tagged messages and piggybacks 2006.
them. [25] Z. Zhenzhou, K. Panos, B. Spiridon, DCMP: a distributed cycle
minimization protocol for peer-to-peer networks, in: Parallel and
end if Distributed Systems, IEEE Transactions, vol. 19, IEEE, 2008, pp. 363–
377.
J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1459
Joydeep Chandra completed B.Tech. (Com- Niloy Ganguly is an Associate Professor in the
puter Science and Engineering) from Haldia Department of Computer Science and Engi-
Institute of Technology, India in 2000 and neering, Indian Institute of Technology,
M.Tech. (Computer Science and Engineering) Kharagpur. He has received his B.Tech from IIT
from Indian Institute if Technology, Kharag- Kharagpur in 1992 and his PhD from BESU,
pur in 2002. Presently, he is pursuing his Ph.D. Kolkata, India in 2004. He has spent two years
from Indian Institute of Technology, Kharag- as Post-Doctoral Fellow in Technical Univer-
pur. His research interests include p2p net- sity, Dresden, Germany before joining IIT,
works, distributed algorithms and complex Kharagpur in 2005. His research interests are
networks. in peer-to-peer networks in particular and
distributed dynamic networks in general. He
has applied various bio-inspired techniques to
solve problems in such networks. He also works on applying complex
networks methodology in various information retrieval problems. He is
presently leading a research group comprising of several research
scholars and masters student as well as various collaborators from
industry and academia. Visit the Complex Networks Research Group
Santosh Kumar Shaw completed BE (Com- (CNERG) at www.cse-web.iitkgp.ernet.in/~cnerg/.
puter Science and Engineering) from Bengal
Engineering College(D.U), India in 2001 and
M.Tech (Computer Science and Engineering)
from Indian Institute of Technology, Kharag-
pur, India in 2008. He served as a faculty
member at University Institute of Technology,
Burdwan University, India. He also served in
Magma Design Automation as Associate
Member of Technical Staff. He is presently
working with Berkeley Design Automation as
a Member of Technical Staff. His research
interests includes P2P networks and Distributed algorithms.