Academia.eduAcademia.edu

Outline

HPC5: An Efficient Topology Generation Mechanism for Gnutella Networks

https://doi.org/10.1007/978-3-540-92295-7_16

Abstract

In this paper, we propose a completely distributed topology generation mechanism named HPC5 for Gnutella network. A Gnutella topology will be efficient and scalable if it generates less number of redundant queries and hence consists of lesser short length cycles. However, eliminating cycles totally, reduces the coverage of the peers in the network. Thus in the tradeoff between the cycle length and network coverage we have found that a minimum cycle length of 5 provides the minumum query redundancy with maximum network coverage. Thus our protocol directs each peer to select neighbors in such a way that any cyclic path present in the overlay network will have a minimum length of 5. We show that our approach can be deployed into the existing Gnutella network without disturbing any of its parameters. Simulation results signify that HPC5 is very effective for Gnutella’s dynamic query search over limited flooding.

Computer Networks 54 (2010) 1440–1459 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet HPC5: An efficient topology generation mechanism for Gnutella networks Joydeep Chandra *, Santosh Kumar Shaw, Niloy Ganguly Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur 721 602, India a r t i c l e i n f o a b s t r a c t Article history: In this paper, we propose a completely distributed topology generation mechanism named Received 15 April 2009 HPC5 for Gnutella network. A Gnutella topology will be efficient and scalable if it generates Received in revised form 10 November 2009 less number of redundant queries. This can be achieved if it consists of a fewer number of Accepted 24 November 2009 short length cycles. Based on this principle, our protocol directs each peer to select neigh- Available online 11 December 2009 Responsible Editor: I.F. Akyildiz bors in such a way that any cyclic path present in the overlay network will not generate any redundant query. We show that our approach can be deployed into the existing Gnutella network without disturbing any of its parameters. We also show that the proba- Keywords: Peer-to-peer networks bility of inconsistencies arising during topology generation, using our mechanism, which Gnutella may lead to the formation of a small number of short length cycles is very low. However, Topology generation we have also proposed an inconsistency handling protocol that detects such short length cycles and effectively removes them. We implemented a Gnutella prototype to compare and validate the efficiency of our protocol over existing Gnutella. Simulation results indi- cate that our mechanism outperforms existing Gnutella in terms of network coverage (the number of unique peers explored during query propagation in limited flooding) and message complexity. Structural analysis indicates that the proposed enhancement con- serves the robustness of existing Gnutella network. Finally, we draw comparisons of the proposed protocol with a state-of-the-art topology optimization protocol named Distrib- uted Cycle Minimization Protocol (DCMP); the simulation results indicate that HPC5 out- performs DCMP in terms of message overhead and network coverage. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction But the main problem of these kinds of networks is scala- bility due to the generation of a large number of redundant Peer-to-peer (p2p) network is an overlay network, use- messages during query search. Consequently, as these net- ful for many purposes like file-sharing, distributed compu- works are becoming more popular, the quality of service is tation, etc. Depending upon the topology formation, p2p degrading rapidly [8,12]. networks are broadly classified as structured and unstruc- To make the network scalable, Gnutella [1–3] is contin- tured. An unstructured p2p network is formed when the uously upgrading it’s features and introducing new con- overlay links are established arbitrarily. Decentralized cepts. All these improvements can be categorized into (fully distributed control), unstructured p2p networks two broad areas: improvement in search techniques and (Gnutella, FastTrack etc.) are the most popular file-sharing modification of the topological structure of the overlay overlay networks. The absence of a structure and central network to enhance search efficiency. In enhanced search control makes such systems much more robust and highly techniques, several improvements like Time-To-Live (TTL), self-healing compared to the structured systems [11,17]. Dynamic query, Query-caching and Query Routing Protocol (QRP) have been introduced. One of the most significant topological modifications in unstructured network was * Corresponding author. Tel.: +91 9434305806. E-mail addresses: [email protected] (J. Chandra), santosh. done by inducing the concept of super-peer (ultra-peer) [email protected] (S.K. Shaw), [email protected] (N. Ganguly). with a two-tier network topology. 1389-1286/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2009.11.017 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1441 1 2 1 2 3 2 S S 2 3 1 2 1 2 (a) (b) Fig. 1. Effect of topology structure on limited flood based search. The number inside the circle represents the TTL value required to reach that node from start node S. The basic search mechanism adhered by Gnutella is lim- Liu et al. [9,10] observed that the structural mismatch ited flooding [1–3]. In flooding, a peer that searches for a between the overlay and the underlying network topology file, issues a query and sends it to all of its neighbor peers. is a major cause for traffic redundancy in p2p networks. The peer that receives the query forwards it to all its neigh- Peers forming blind overlay connection without knowl- bors except the neighbor from which it is received. By this edge of the underlying physical network can generate huge way, a query is propagated up to a predefined number of redundant traffic between autonomous systems. They pro- hops (TTL) from the source peer. Recent versions of Gnutel- posed a location aware topology matching algorithm (LTM) la uses dynamic querying protocol [5], where the TTL fol- to handle this problem. The basic objective of LTM is to lowed is generally 1 or 2 for popular searches; however, detect the peers within same autonomous systems on the the query with TTLð3Þ (numeric value inside parenthesis basis of delay incurred by the messages, and then reorga- represents the number of hops to search with) is initiated nize the overlay links so that peers within the same auton- for rare searches. omous systems are grouped together. LTM uses three The main goal of this paper is to improve the scalability fundamental operations to achieve this objective. (a) of the Gnutella network by reducing the redundant mes- TTL2-Detector Flooding: Each peer floods a special message, sages. One of the ways to achieve this is to modify the named TTL2-detector message, periodically. The source overlay network, so that small size loops get eliminated peer of this message stamps its IP address and the time- from the overlay topology. The rationale behind the prop- stamp; a neighbor on receiving this message appends its osition is explained through Fig. 1. In this figure, both net- IP address and the TTL1 time-stamp and forwards it to its works have the same number of connections. With a TTLð2Þ neighbors. Peers receiving the detector message can deter- flooding, the network in Fig. 1a discovers four peers at the mine the connection speeds from the message time- expense of seven messages, whereas the network in Fig. 1b stamps. If a peer P receives multiple detector messages discovers six peers without any redundant messages. This from same source, it initiates the second operation named, happens due to the absence of any 3-length cycle in the slow connection cutting. (b) Slow Connection Cutting: Peer network of Fig. 1b. On generalizing, we can say that for a P, on receiving duplicate detector messages from a source TTL(r) flooding, networks devoid of cycles of length less node determine the slowest connection link from the mes- than (2r + 1) do not generate any redundant messages. sage time-stamps. If the slowest link is a link of node P, the We refer such networks which do not have any cycles up link is disconnected. (c) Source Peer Probing: A peer P on to length (r  1) as cycle-r network.1 In this paper, we pro- receiving a message with TTL ¼ 0 from source S through pose a handshake protocol that generates a cycle-5 network peer N probes the connection cost of the link SP and also topology. This topology will thus rarely produce any redun- obtains the connection cost of links SN and NP from the dant message while performing normal TTLð2Þ search. The message. If link SP does not have the highest cost, the link, strength of our proposed mechanism is its simplicity and SP, is created and NP is disconnected. Although, this algo- the ease of deployment over existing Gnutella networks rithm improves search efficiency but the traffic generated along with its power to generate topologies having high effi- due to the detector messages is enormously high; with n ciency in terms of message complexity and network cover- peers, the traffic incurred is Oðn2 Þ. A similar class of overlay age [18]. topology based on distance between a node and its neigh- bors in the physical network structure is presented in [13]. Papadakis et al. [15] presented an algorithm to monitor 1.1. Related work the ratio of duplicate messages through each network con- nection. Each peer in the network monitors the number of P2P traffic has grown immensely in the recent years and duplicates received through each of its link, over the total they constitute a large portion of the total Internet traffic. number of received messages. On receiving a duplicate Hence developing suitable mechanisms so as to control message through a link, a peer P sends a feedback to the excessive traffic, thus reducing the burden on ISP’s, besides upstream node, say Q, of the link. The upstream peer Q maintaining an adequate search performance, has become maintains a record of the number of duplicate messages an important research issue. We discuss certain proposed sent by itself to the downstream node. Consequently, when mechanisms that broadly attempt to modify the topology the ratio exceeds a certain threshold value, the node does in unstructured p2p networks to solve the excessive traffic not forward any query through that connection. However, problem and improve search performance. this mechanism suffers from certain inherent drawbacks. Although a peer can receive several duplicate messages 1 A formal proof of the statement for r ¼ 2 is stated in Appendix B. through a link from a particular source node, but that link 1442 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 might be the only path to several other source nodes. compatibility of implementing our protocol in Gnutella Disconnecting that link reduces the network coverage of networks are described in Section 4. In Section 5, we discuss the peers. Moreover this method is not suitable in case of the effects of inconsistency on HPC5 and state suitable algo- high peer churn where peers arrive and depart frequently. rithms to reduce its effects. We evaluate the performance Zhu et al. [25] very recently presented a Distributed Cy- and analyze the robustness of the evolved networks cle Minimization Protocol (DCMP) to improve the scalability through simulation in Section 7. Moreover, we also draw of Gnutella-like networks by reducing redundant mes- comparisons of our algorithm with DCMP. In Section 8 we sages. They have pointed the same concept of elimination conclude and draw directions for future work. of short-length cycles cycles. However, this is demand A list of main notations that will be used throughout the driven and involves a lot of control overhead. The algo- paper is summarized in Table 1 for ready reference. rithm does not preserve the Gnutella parameters (like de- gree distribution, average peer distance, diameter, etc.), 2. Basic system model of Gnutella 0.6 hence robustness of the evolved network is not main- tained. In our work, we take into consideration all the as- In order to carry on experiments, a basic version of Gnu- pects like control overhead, network coverage, and tella 0.6 [1–3] has been implemented. The basic Gnutella robustness, and propose a holistic yet simple approach to consists of a large collection of nodes that are assigned un- topology formation. The algorithm is initiated as soon as ique identifiers and which communicate through message a peer enters in the network, rather than having it demand exchanges. driven. Thus the algorithm works during the bootstrap phase when the network is forming so that less overhead 2.1. Topology is involved afterwards. It further eliminates the inconsis- tencies that might occur during the bootstrap phase. We Gnutella 0.6 is a two-tier overlay network, consisting of later compare the performance of our algorithm with two types of nodes: ultra-peer and leaf peer (the term peer DCMP, in terms of message overhead and network cover- represents both ultra and leaf-peer). An ultra-peer is con- age, using simulations. It is observed that for large network nected with a limited number of other ultra-peers and size, our algorithm outperforms DCMP in both these leaf-peers. A leaf-peer is connected with some ultra-peers. aspects. However, there is no direct connection between any two Yet another class of algorithm exists that works on the leaf-peers in the overlay network. Yet another type of peer margins of community formation depending upon similar is called legacy-peers, which are present in ultra-peer level file interests. In these algorithms, the network topology is and do not accept any leaves. Since legacy-peers are negli- restricted to certain clusters, where intra-cluster traffic gible in the network (they constitute approximately 0.3% of can be high but inter-cluster traffic very low. the total peers [24]), hence we are not considering legacy- Chen et al. [6] proposed a distributed technique to iden- peers in our further analysis. tify peers sharing similar file interests and form overlay connection among them. In this technique, each file f is 2.2. Basic search technique identified by a set of attributes, say a1 ; a2 ; . . . ; an , based on its specific application domain. For a pair of files, say ðiÞ ðiÞ ðiÞ ðjÞ ðjÞ ðjÞ The network follows a limited flood based query search. fi ¼ a1 ; a2 ; . . . ; an and fj ¼ a1 ; a2 ; . . . ; an , a feature func- ðiÞ ðjÞ A query of an ultra-peer is forwarded to its leaf-peers with tion Fða1 ; a1 Þ measures the correlation between the attri- ðiÞ ðjÞ TTLð0Þ and to all its ultra-neighbors with one less TTL only butes a1 and a1 , i.e. whether the two files fi and fj are when TTL > 0. A leaf-peer does not forward a query related through these attributes. This correlation is used received from an ultra-peer. On the other hand ultra-peers to measure a conditional probability, Prðfj jfi Þ, that repre- perform query searching on behalf of their leaf-peers. The sents the conditional probability that a peer will request for a file fj such that it had earlier requested for a file fi . When a new query for a file f arrives to a peer p, it returns Table 1 f if the file is present. Otherwise, it forwards the query to a Notations. new peer q that had earlier requested a file fe that have the TTLðrÞ Query search with TTL ¼ r highest conditional probability, Prðf jfe Þ among all other file Cycle-r A network which does not have any cycle up to requests. The idea is that, since q has requested for file fe , network length ðr  1Þ the probability that it has queried for the file f and obtained Cycle-3 Gnutella network network it is very high. However, there are two major drawbacks of N Total number of peers in the network this approach; finding an appropriate feature function is U Total number of ultra-peers in the network difficult, and secondly a large number of file-sharing infor- L Total number of leaf-peers in the network mation needs to be disseminated that induces high control duu Avg. no. of ultra-neighbors of an ultra-peer dul Avg. no. of leaf-neighbors of an ultra-peer overhead. dlu Avg. no. of ultra-neighbors of a leaf-peer Hk Hit ratio to select kth ultra-neighbor 1.2. Organization of the paper hHi Average hit ratio of a peer hHev i Average evolved hit ratio of a peer A model of our network environment and basic hand- rth A peer at a distance of r hops. All immediate neighbor neighbors are 1st neighbors, all neighbors of 1st shake protocol is presented in Section 2. In Section 3 we neighbors are 2nd neighbors and so on have described our handshake protocol. The problems and J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1443 query of a leaf-peer is initially sent to its connected network. These features are obtained from the snapshots ultra-peers. All the connected ultra-peers simultaneously collected by crawlers [3,19,22]. The properties of the Gnu- forward the query to their neighbor ultra-peers up to a lim- tella [1,22] and simulated Gnutella networks are given in ited number of hops. Since multiple ultra-peers are initiat- the Table 2. We next highlight the functional features of ing flooding, a leaf-peer’s query will produce more our simulated Gnutella. redundant messages if the distance between any two ul- tra-neighbors is not enough. Gnutella 0.6 incorporates dy- 1. The simulator is a multi-threaded parallel execution namic querying [5] over limited flooding as query search simulator in which the Gnutella processes running technique. In dynamic querying, an ultra-peer incrementally at each nodes are implemented using separate forwards a query in three steps (TTLð1Þ; TTLð2Þ; TTLð3Þ, threads. Each peer arrival spawns a new thread that respectively) through each connection after measuring the follows the required bootstrap protocol to form responsiveness in each step. The ultra-peer can stop for- neighbors. warding a query at any step if it gets sufficient number of 2. The network grows from an initial set of 20 pre-existing query hits. Consequently dynamic querying uses TTLð3Þ only ultra-peers that are connected in the form of an one- for rare searches. Modern Gnutella protocol uses QRP tech- dimensional lattice of degree 2 and forms the initial nique over dynamic querying in which a leaf-peer creates topology. Each of these peers initially constitutes the a hash table of all the files it is sharing and sends that table GWebCache system. to all the immediate ultra-neighbors. As a result, when a 3. As peers join, the GWebCache system is updated as query reaches an ultra-peer it is forwarded to only those some proportion of ultra-peer id’s are recorded in the connected leaf-peers which would have query hits [1,2]. GWebCaches. The entries in the GWebCache system provide a gateway to the Gnutella network for new 2.3. Basic handshake protocol incoming nodes, as new nodes can get some initial peer addresses to connect. Many software applications (clients) are used to access 4. Certain number of peers arrive in bursts (we have used the Gnutella network (like Limewire, Bearshare, Gtk-gnutel- a burst size of 25,000); arriving peers are randomly la). The most popular client software, Limewire’s hand- assigned an ultra or leaf-peer status in a ratio as stated shake protocol is used in our simulation as a base in Table 2. These peers randomly connect to any one of handshake protocol. Through handshaking, a peer estab- the existing ultra-peers, the information of which is lishes connection with any other ultra-peer. To start a retrieved from a random peer in the GWebCache handshake protocol, a peer first collects the address of an system. online ultra-peer from a pool of online ultra-peers. A peer 5. Peers deploy the pong-caching [5] mechanism to main- can collect the list of online peers from GWebCache sys- tain information about other known peers; peers then tems [4] and/or through pong-caching and/or from its use a gossip based mechanism, named ping-pong mech- own hard-disk which has obtained a list of online ultra- anism, to obtain these cached peer addresses and sub- peers in the previous run [8]. The handshake protocol is sequently connect to them until the maximum used to make new connections. A handshake consists of neighbor count is reached. three groups of headers [1,2]. The steps of handshaking 6. Nodes send connection requests to a random set of are elaborated next: peers, whose addresses they have obtained using the ping-pong mechanism. Connection requests are sent 1. The program (peer) that initiates the connection sends sequentially, in a non-concurrent manner. We define the first group of headers, which tells the remote pro- the time difference between sending of two consecutive gram about its features and the status to imply the type connection requests by a peer as a step. Thus, at each of neighbor (leaf or ultra) it wants to be. step, a peer sends a connection request to exactly one 2. The program that receives the connection responds peer. with a second group of headers which essentially con- veys the message whether it agrees to the initiator’s Table 2 proposal or not. Properties of Gnutella and simulated Gnutella network 3. Finally, the initiator sends a third group of header to Property Gnutella Simulated confirm and establish the connection. Gnutella No. of peers 2000k 100k This basic protocol is modified in this paper to over- Ultra-peer ratio 15–16% of total peers 15% of total come the problem of message overhead. peers Avg. diameter of ultra- 6–7 4–5 2.4. Simulated Gnutella layer Maximum connections Ultra–ultra 32 Ultra–ultra 32 Ultra-leaf 30 Ultra-leaf 30 To generate an existing Gnutella network, we have Leaf-ultra 3 Leaf-ultra 3 simulated a strip down version of Gnutella 0.6 protocol (Applicable to which follows the parameters of Limewire [1]. Our simu- Limewire) lated Gnutella network exhibits all features (like degree Average duu : 25–26 duu : 22–23 Connections dul : 20–22 dul : 17–18 distribution, diameter, average path length between two dlu : 3–4 dlu : 3 peers, proportion of ultra-peers, etc.) exhibited by Gnutella 1444 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 7. To study the effect of node removal on the Gnutella The two steps of modified handshake protocol (HPC5) is Network, we also implemented a node removal func- described below. tionality in which a desired number of nodes is ran- domly removed from the network. However, since this 1. The initiator peer first sends a request to a remote reduces the degree of the existing peers, peers attempt ultra-peer which is not in its 1st or 2nd neighbor to regain their lost degree by reconnecting with other set. The request header contains the type of the initi- peers. To simulate this effect, we assume that at each ator peer. The presence of remote peer in 2nd neigh- step, a peer that has lost a neighbor sends connection bor set implies the possibility of 3-length cycle. In request to exactly one peer. If the connection request Fig. 2, peer-1 cannot send request to peer 2 or 3, on is from a valid peer, i.e. the node can be accepted as a the other hand peers 4 and 5 are eligible remote neighbor, then the connection request is accepted. At ultra-peers. the end of a step, all peers connect to valid peers from 2. The recipient replies back with its list of 1st neighbors whom a connection request has arrived. We assume and the neighbor-hood acceptance/rejection message. that at each step, a peer that has lost an edge connects If the recipient peer rejects the connection in this step, to at least one new node; hence if it has lost D degrees, the initiator closes the connection and keeps the then it requires a maximum of D steps to restore its lost record of neighbors of the remote peer for future degree. handshaking process. On acceptance of the invitation by the recipient peer, the initiator checks if at least one common peer between its 2nd neighbor set (say, The peer selection strategy during bootstrapping can A) and the 1st neighbor set of the remote peer (say, play an important role in the search performance. Several B) exists. A common ultra-peer between sets A and B gossip based strategies exist for random peer sampling indicates the possibility of 4-length cycle. [7,21,23]; however, to maintain parity, we implement the If no common peer is present between sets A and B pong-caching mechanism that is actually proposed in then the initiator sends accept connection to remote Gnutella. We ran our simulator on a server having DP Intel peer. Dual Core Xeon 3.2 GHz 5060 processor with 4GB DDR2 Otherwise the initiator sends reject connection to RAM, and could simulate for a network size of 1 million remote peer. nodes. In the next section we discuss the proposed HPC5 HPC5 prevents the possibility of forming a cycle of mechanism in details. length 3 or 4 and generates a cycle-5 network. 4. Hurdles in implementing the scheme 3. HPC5: Handshake protocol for cycle-5 networks Before embedding the simple HPC5 in Gnutella net- Fig. 2 illustrates the proposed HPC5 graphically. In work, several important questions have to be taken into Fig. 2, peer-1 requests other online ultra-peers to be its consideration to assess its viability. The most important neighbor, given that, peer-2 is already a neighbor of of them are listed below. peer-1. In Fig. 2a and b, the possibility of the formation of triangle and quadrilateral arises if a 1st or 2nd neighbor 1. Is this scheme compatible with the current population of peer-2 is selected. However, this possibility is discarded of Gnutella network? in Fig. 2c and a cycle of length 5 is formed. 2. On average, how many trials are required to get an Each peer maintains a list of its 1st and 2nd neigh- ultra-neighbor? bors, which contains only ultra-peers (because a peer 3. Is there any possibility of an inconsistency and if so, only sends request to an ultra-peer to make neighbor). how can such inconsistency be removed? The 2nd ultra-neighbors of a leaf-peer thus represent the collection of neighboring ultra-peers of its adjacent Each of the questions are discussed one by one. ultra-peers. To keep an updated knowledge, each ultra- peer exchanges its list of 1st neighbors periodically with its neighbor ultra-peers and sends the list of 1st neigh- 4.1. Compatibility with the current population of Gnutella bors to its leaf-peers. To do this with minimal overhead, piggyback technique can be used in which an ultra-peer From ultra-peer point of view, the total number of ul- can append its neighbor list to the messages passing tra-leaf connections is U  dul and from leaf-peer point of through it. view it is L  dlu . By equating both, we get 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 (a) (b) (c) Fig. 2. Selection of neighbor by peer-1 after making peer-2 as a neighbor. J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1445 U  dul ¼ L  dlu 2nd (ðk  1Þðduu  1Þ ultra-peers) ultra-neighbors as a po- U  dul ¼ ðN  UÞ  dlu tential neighbor and this exclusion is locally done by check- ð1Þ ing the 1st and 2nd neighbors lists of P. The number of ðdul þ dlu Þ N¼U ultra-peers excluded ðU 0 Þ is then ½ðk  1Þ þ ðk  1Þðduu  1Þ. dlu According to step-3 of HPC5, P cannot make neighbor Fig. 3 represents a part of an ultra-peer layer where P has from any ultra-peer of level S (3rd ultra-neighbors of P) of immediate neighbors at level Q. Suppose, P is already con- Fig. 3 as its neighbor which are U 00 ¼ ½ðk  1Þðduu  1Þ2  in nected with ðduu  1Þ number of ultra-peers at Q level and number. wants to get duu th ultra-neighbor. According to HPC5, P Therefore, total ½U 0 þ U 00  number of ultra-peers are ex- should not connect to any ultra-peer from R or S level as cluded. So hit ratio can be given as, its next neighbor. However, T can be a neighbor of P. Thus, U  ðU 0 þ U 00 Þ we can say that if P wants to make a new ultra-neighbor then Hk ¼ P has to exclude at most ðduu 1Þþðduu 1Þ2 þðduu 1Þ3 num- U  U0 2 ber of ultra-peers from Q, R and S level, respectively. So, total Assuming U 0  U; U 0  U 00 and U 00  duu  ðk  1Þ 3 ½ðduu 1Þþðduu 1Þ2 þðduu 1Þ3 duu number of peers cannot Therefore Hk becomes be considered as next neighbor(s) of P. Therefore the number 2 of ultra-peers in the network needs to be at least U  duu  ðk  1Þ Hk  ð4Þ U 3 U duu ð2Þ The upper bound of k and consequently average ultra-de- From Eqs. (1) and (2) we get gree differs in leaf-peer and ultra-peer. To generalize fur- 3 ther calculations, let m be the average ultra-degree of a ðdul þ dlu Þ  duu peer. So, average hit ratio is N ð3Þ dlu " # m m 2 1 X 1 X duu  ðk  1Þ Presently Gnutella network is having the population of al- hHi ¼  Hk ¼  1 m k¼1 m k¼1 U most 2000k of peers at any time [1]. From Eq. (3) it can be seen that for the present values of duu ; dul and dlu (Table 2), 2 duu  ðm  1Þ 120–130k peers are sufficient to implement HPC5 protocol. ¼1 ð5Þ 2U However, to form cycle-6 networks (HPC6) the number of peers Eq. (5) shows the average hit ratio of peer joining the net- work when the population of ultra-peers in the network is 4 ðdul þ dlu Þ  duu U. It also reflects that ð1  hHiÞ is inversely proportional to N dlu the number of ultra-peers (U) in the complete network. Now as each node joins, the network grows. As a result required is more than 2000k. Hence the current population the average hit ratio changes with the network growth. will not be able to support any such attempts. Therefore evolved hit ratio is the average value of all aver- age hit ratios which are calculated at each growing stages 4.2. Hit ratio of the network. Let U 0 and U n be the number of ultra-peers in the initial and final networks. So, evolved hit ratio is Hit ratio is defined as the inverse of the number of trials required to get a valid ultra-peer neighbor. As our protocol Un 2 Un 1 X d  ðm  1Þ X 1 puts some constraints on neighbor selection, a contacted hHev i ¼  hHi ¼ 1  uu  U n  U 0 U ¼U 2  ðU n  U 0 Þ U ¼U U i agreeing remote ultra-peer may not be selected as neigh- i 0 i 0 bor. Mathematically, on an average if a peer (say, P) is look- Now, ð6Þ ing for its kth ultra-neighbor and only the mk th contacted ultra-peer satisfies the constraints and becomes kth ul- Un 1 Un X   tra-neighbor of P, then the hit ratio for kth neighbor will  log U ¼U U i U0 be Hk ¼ m1 . We first make a static analysis of hit ratio, then i 0 k fine tune it considering that the network is evolving. Therefore, Eq. (6) becomes At the time of the kth ultra-neighbor selection in HPC5, a 2 peer (say, P) does not consider its 1st (ðk  1Þ ultra-peers) and duu  ðm  1Þ hHev i  1   logðU n =U 0 Þ ð7Þ 2  ðU n  U 0 Þ From Eqs. (5) and (7) we get, hHev i 6 hHi As d and the maximum value of m are bounded, the value of P Q R S T hHev i increases with U. Again we have tested this phenomenon through our simulation and plotted the evolved hit ratio against the network size varying from 200k to 1000k in Fig. 4 and observed the similarity Fig. 3. A part of an ultra-peer layer, where a node represents all nodes between them. The similarity is not pronounced in the begin- that are present in that level. Like, Q represents all 1st neighbors of P. ning as the approximations made to develop Eqs. (4) and (7) 1446 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 Empirical and Theoretical Hit Ratio of a peer Similarly smaller cycles may be created when multiple 1 peers contact each other as a directed cycle (as in Fig. 5b) Theo: Evolved Hit Ratio within the period of two successive updates. 0.8 Emp: Evolved Hit Ratio In the next section, we provide a detailed analysis of the topological structure arising due to node failure; however, Hit Ratio 0.6 the analysis will be analogous for inconsistencies arising 0.4 due to bursty peer arrival. We analytically derive the aver- age number of short length cycles formed in the ultra-peer 0.2 level due to inconsistencies arising from node failure and validate our results using simulation. We then propose 0 an algorithm to handle these inconsistencies, supported 400000 600000 800000 1e+06 by extensive simulation results to validate the efficiency Number of Peers of our algorithm. Fig. 4. Hit ratio of a peer against the number of peers in the network. are significant in smaller networks. However, as the number 5.1. Inconsistency arising in the face of failure/attack of peers increases, both hHi and hHev i converges to the same value. Here, we discuss about the topological status of the net- work when x fraction (where x  1) of the nodes leave the network. Our analysis deals with the topological structure 5. Consistency problems in HPC5 at the ultra-peer level; this is because, with QRP, search queries are not forwarded to the leaf-peers, hence the For HPC5 to work correctly, each peer must send correct number of short cycles at the ultra-peer level affect the information about its neighbors to other peers that want to traffic redundancy in the network. Thus, in all our deriva- connect. To facilitate this process, peers can periodically tions in this section, the term, peer, refers to an ultra-peer exchange the list of their neighbors. Periodically exchang- in the network. We derive the distribution of the number ing the list of neighbors facilitates the peers to get up-to- of 3 and 4-length cycles formed around an ultra-peer, date information about their neighbors. However, this when a small fraction, x, of the peers leave the network. leads to a situation when parallel update might occur, Then we go on to derive the total number of such cycles where in between two successive updates, a peer may pos- formed in the network. We provide simulation results to sibly have erroneous knowledge about it’s neighbors. As a test and compare the validity of our models. We assume result, this inconsistency of the network leads to the pres- that the nodes have left uniformly from different parts of ence of 3-length or 4-length cycles. Parallel update is pos- the network. So each peer loses a fraction of its neighbors sible when many peers enter simultaneously, like during and in effect the average degree of a peer in the network the bootstrap phase [20], or when there is a huge failure/ becomes less. To maintain the degree distribution of the attack in the network whereby many nodes have lost their network, each peer contacts other remote ultra-peers to neighbors and would now like to quickly gain some. fulfill neighbor deficiency. During this process, 3-length In parallel update, due to inconsistency, short length cy- and 4-length cycles are created temporarily due to incon- cles are formed as multiple peers in the same cycle hand- sistency between two successive updates. As defined ear- shake in parallel with a third common ultra-peer and lier, we consider the time difference between sending become each other’s neighbor. An instance of a parallel up- two consecutive connection requests, by a peer to other date situation is illustrated through an example (Fig. 5a) peers, as a step; we assume that at each step a peer sends where peer-1 and peer-5 execute the following actions a connection request to exactly one another peer, however, according to steps of HPC5 and form cycle-3. it receives connection requests from many peers. After the removal process, U rem ¼ ðU  xUÞ number of ultra-peers re- 1. Both peer-1 and peer-5 find that peer-P is a valid main in the network. Since, we have assumed that the remote peer to contact and both send request to P. peers have been removed uniformly throughout the net- 2. Peer-P gets their request more-or-less at the same time work, the average ultra-neighbors of an ultra-peer will get reduced to d ^uu  duu ð1  xÞ. To restore the ultra-neigh- and sends back the neighbor-hood status to them. 3. As peer-1 and peer-5 do not know each other’s activity bor count, each ultra-peer tries to connect to new ultra- or updated status, they make P as their new neighbor, peers. We assume that, on average, each peer connects to therefore a cycle-3 is formed due to this inconsistency. xduu new ultra-peers so as to restore the average number of ultra-neighbors to duu . According to HPC5, a peer (say 2 1 peer-1) cannot make any ultra-peers at level Q, R or S in 1 2 Fig. 3 as its neighbor. With an average ultra-neighbor of 3 ^uu , the number of ultra-peers from which a peer has to P d 5 4 3 choose a neighbor is (a) (b) ^uu þ d ^uu ðd ^uu  1Þ þ d ^uu ðd ^uu  1Þ2 ; Nrem ¼ U rem  ½d Fig. 5. A part of cycle-5 network, representing parallel update ^3 þ d ^2  d ^uu : inconsistency. ¼ U rem  duu uu J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1447 We define this set from which a peer can choose a neighbor the same peer. Two combinations are considered as as the potential neighbor set of the peer. Hence N rem repre- isomorphic with respect to peer-1 (Fig. 6a), if removing sents the size of the potential neighbor set of a peer; the the labels of the other peers, (i.e. peer-5 and P in this case), potential neighbor set of every peer can be assumed to makes both the graphs similar to each other. Thus a 3- be approximately same in the limit of large N rem . We next length cycle can be formed due to three major cases. find the average number of new connections made by a node at each step, i.e. the average degree that is restored. 1. Each of the three peers, peer-1, peer-5 and P sends con- At each step, a connection is made to peer-1 if peer-1 sends nection request to each other. Thus we have 23 ¼ 8 a connection request and finds a valid peer, or any valid combinations of directed edges, out of which six possi- peer sends a connection request to peer-1. The probability ble combinations are invalid and the rest two are iso- that peer-1 finds a valid peer at a step is NUrem rem and the prob- morphic cases; hence there is one possible way in ability that any valid peer connects to peer-1 is which a 3-length cycle can be formed due to this case. 1 N rem  Nrem  NU rem rem ¼ NUrem rem . Thus the average number of connec- This case is shown in Fig. 6a. Some of the invalid cases tions formed by peer-1 at each step is C ¼ 2 NUrem rem and the are shown in Fig. 6b. average number of steps required to restore the average 2. There is an existing edge between peer-5 and P, and is approximately these two nodes simultaneously connect to peer-1. The number of such combinations of directed edges xduu U rem xduu are 4, out of which one is invalid and two are isomor- S¼ ¼ ð8Þ C 2Nrem phic. Thus a cycle can be formed due to this case in two possible ways (Ref. Case-2 in Fig. 7a). We now derive the distribution of the number of 3 and 4- 3. An edge exists between peer-1 and either P or peer-5. length cycles formed around a peer at each step and then Thus 3-length cycles are possible due to four possible find the average number of cycles formed in the network. combinations of directed edges out of which one is It is similar to finding certain prohibitive network motifs invalid, leading to three possible ways (Ref. Case-3 in [14] that might arise due to inconsistent updates. Fig. 7a). 5.2. 3-Length cycle We now find the probability that a 3-length cycle forms around peer-1 for each of the above three cases. A 3-length cycle is created if three peers get involved in Case 1. In case 1, the peers connect to each other in a HPC5 as in Fig. 5. The initiation of handshake protocol in cyclic order. The probability that peer-1 connects to a ran- different combinations among three peers may create a dom node (say peer-5), which in turn connects to another triangle. We attempt to model the average number of random peer (say P) that connects back to peer-1 is 3-length cycles formed around a peer, and the average approximately number of such cycles in the whole network. Considering 3 3 peers named, peer-1, peer-5 and P, as shown in Fig. 5a, 1 1  ð3Þ at each step, a 3-length cycle can be formed around peer- A1 ¼ ðN rem ÞðNrem  1Þ  ð9Þ Nrem Nrem 1 by various combination of connection requests made by each of the three peers. If we assume that no two peers Case 2. In case 2, the probability that two given peers, simultaneously send connection requests to each other at which are themselves connected,  2both can connect to 1 any step, the various ways of forming a cycle among the peer-1 in any possible way is Nrem . Since each peer has three peers can be represented by connecting the peers ^ an average degree of duu , the average number of edges ^ using directed edges as shown in Fig. 5. shared between N rem peers is Nrem2duu , which is the possible A directed edge from peer-1 to peer-5 implies that peer- number of neighboring peers that can simultaneously at- 1 has sent a connection request to peer-5. We consider a tempt to connect to peer-1 forming 3-length cycles. combination of directed edges as invalid (Fig. 6b), if one Assuming, the probability that more than one pair of adja- peer sends connection requests to two different nodes at cent peers connects to peer-1 is negligible, the probability one single step, i.e. no two directed edges go out from that a 3-length cycle formed due to case 2 is Peer-1 Peer-1 Peer-1 Peer-1 Peer-5 P Peer-5 P Peer-5 P Peer-5 P Fig. 6. Case 1. Graphs in (a) are isomorphic with respect to peer-1, removing the labels peer-5 and P from both these graphs makes the two graphs absolutely similar. Isomorphic combinations can be considered as single combination with respect to peer-1. This combination also represent case 1 of 3- length cycle formation. Graphs in (b) are invalid as in first case peer-5 sends connection requests to two peers simultaneously (represented by the directed edges from peer-5), whereas in the second figure, peer-1 sends two connection requests at the same time. 1448 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 peer- peer- peer- peer- peer- peer- 1 1 1 1 1 1 Peer Peer Peer Peer Peer Peer Peer Peer -2 -3 -2 -3 -2 -3 -2 -3 peer peer P P Peer Peer Peer Peer -5 -5 -4 -4 -4 -4 Case 2 Case-1 Case-2 Case-3 Case-4 peer- peer- peer- peer peer peer 1 1 1 -1 -1 -1 Peer Peer Peer Peer Peer Peer -2 -3 -2 -3 -2 -3 peer peer peer P P P Peer Peer Peer -5 -5 -5 -4 -4 -4 Case 3 Case-5 Case-6 Case-7 Fig. 7. Possible cases of 3 and 4 length cycles. ^ Nrem d ^ d each other in the same step is zero, thus in this case, a ð3Þ uu uu A2 ¼ 2  ð10Þ 4-length cycle can be represented by various combina- 2ðNrem Þ 2N rem tions of the four directed edges among these peers. Out There are two possible valid combinations for formation of of the 16 possible combinations, 14 combinations are a cycle due to case 2, and thus each will have an occurrence invalid, and two combinations are isomorphic to each ^ probability equal to 2Nduurem . other with respect to peer-1. Case 3. For each possible valid combination in case 3, 2. An edge exists between peer-3 and peer-4. 4-Length the probability that a random node connects to peer-1 is cycles will be formed if one of these peers connects to ^ uu 1 and also connects to any neighbor of peer-1 is Ndrem . Thus peer-1 and another peer connects to a peer, say peer- Nrem the probability of forming a 3-length cycle around peer-1 2, that in turn connects to peer-1. However, there can due to any valid combination of case 3 is be several possible combinations by which these con- nections can be made. The number of such possible ^uu d combinations are 23 ¼ 8, out of which four are invalid. ð3Þ A3 ¼ ð11Þ Thus 4-length cycles can be formed due to this case in Nrem four possible ways. ð3Þ ð3Þ ð3Þ Considering each of the probabilities, A1 ; A2 and A3 to 3. There is an existing edge between peer-1 and any of the be small in the limit of large N rem , the average number of other peers, say peer-2. Similar to the previous case, the 3-length cycles formed around peer-1 after it has restored number of such possible combinations are 23 ¼ 8, out of its average degree is which four are valid, leading to four possible ways of  ð3Þ ð3Þ ð3Þ  forming a 4-length cycle. Að3Þ ¼ S  A1 þ 2A2 þ 3A3 ð12Þ 4. Two edges are existing among three peers and none of and the total number of 3-length cycles in the whole net- the edge belongs to peer-1, i.e the peers – peer-2, work is given by, peer-3 and peer-4 – are already connected and nodes peer-2 and peer-4 attempt connecting to peer-1. The 1 ð3Þ number of such possible combinations are 22 ¼ 4, out L3 ¼  A  U rem ð13Þ 3 of which one is invalid and two are isomorphic. Thus a 4-length cycle can be formed due to this case in two 5.3. 4-Length cycle possible ways. 5. Two edges are existing, both of which are from peer-1 Similar to the case of 3-length cycle, formation of a 4- that connects to peers, say peer-2 and peer-3. A 4-length length cycle around a peer involves connections among 4 cycle will be formed if both peer-2 and peer-3 connect to peers, namely, peer-1, peer-2, peer-3, and peer-4. The def- the fourth peer, i.e. peer-4 in any possible combination initions for invalid combinations and isomorphic combina- of directed edges. Similar to the previous case, the num- tions remain same as defined previously. We enumerate ber of such possible combinations are 22 ¼ 4, out of the possible ways in which a 4-length cycle can be formed which one is invalid and two are isomorphic, leading around peer-1. For each of these cases, we represent one of to two possible ways of forming a 4-length cycle. the possibilities in Fig. 7b. 6. Two edges are existing, one of which belongs to peer-1 and another does not, i.e an edge connects peer-1 and 1. Each of the four peers have no connection among them- another peer, say peer-2 and another edge connect selves and randomly select each other so as to form a peer-3 and peer-4. The number of possible edge combi- cycle. Similar to the case of 3-length cycles, we assume nations in which a 4-length cycle is formed is 22 ¼ 4, that the probability of two peers sending request to with no invalid or isomorphic cases. J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1449 7. Two edges are existing, one belonging to peer-1, con- Total number of 3 and 4-length cycles necting to any of its neighbor, say peer-2, and another Model Simulations connecting peer-2 to any other peer, say peer-4. The pos- 1000 Total number of 3 and 4-length cycles sible number of edge combinations in which a 4-length cycle is formed is 22 ¼ 4, out of which one is invalid, thus 800 leading to three possible ways of cycle formation. 600 We derived the probability of forming a 4-length cycle around peer-1 for each of these cases in the limit of large N rem , as stated in Table 3. We find the probabilities of 400 occurrence of 4-length cycles due to cases 1, 2 and 3 to be significantly smaller in the limit of large N rem . Thus, 200 there are 11 major possibilities of occurrence of 4-length cycles; considering these occurrence probabilities as equal ^2 d 1 2 3 4 5 6 7 8 9 10 to 2Nrem , the number of 4-length cycles formed around a Fraction of peers removed (x) peer at each step can be modeled as a binomially distrib- uted random variable with the probability mass function Fig. 8. The total number of cycles formed in the network in HPC5 due to   11i inconsistent updates for a network with 200k total peers and 30k ultras. 11 i  ^2 d ^2 given as 1  2Ndrem . The average number i 2N rem the potential neighbor set of two random nodes is approx- ^2 11d of cycles thus formed at each step equals 2Nrem . However, imately same, which is true when Nrem is large. With according to Eq. (8), the number of steps required for a increasing values of x, the assumption of N rem being very peer to restore its average degree is represented as S, large, does not hold. A lower N rem value implies a large hence the average number of cycles around peer-1 after overlap of the neighbor sets of any two random peers; this it has restored its average is results in formation of more number of short-length cycles ^2 than is actually predicted by our model. Similarly, for very 11Sd Að4Þ ¼ ð14Þ low values of x, we find a small variation of the simulation 2N rem results with our model. This is due to an effect that we Thus the average number of cycles in the whole network term as the interconnection effect. In practice, the average equals degrees actually lost by the remaining peers is slightly lower than that assumed in the model. This is because, 1 11Sd ^2 11Sd ^2 L4 ¼  U rem  ¼ U rem  ð15Þ we have assumed that the nodes which leave have no con- 4 2Nrem 8Nrem nection within themselves, which is not strictly true. How- and the total number of short length cycles formed in the ever, as value of x increases, the impact of overlapping network is, neighbor set effect becomes more prominent, the balance is noticed for moderate values of x (4–6%), where simula- L ¼ L3 þ L4 ð16Þ tion results match perfectly with the model values. We simulated the HPC5 protocol to find the total number We find that the total number of short-length cycles of cycles in the network, and compared the results with formed when 10% of the peers have been removed is our model as shown in Fig. 8. The number of peers consid- around 1025, which is quite small compared to the size ered for our simulations was 200k, having nearly 30k ultra- of the network. Although the number of cycles formed peers. We found that, as per our assumptions, for low to due to inconsistency in HPC5 is small, however, in the next moderate values of x, the simulation results match well section, we provide an outline of detecting and removing with the model values and deviates when the values of x small length cycles arising due to inconsistency. are large. This is due to an effect that we term as the over- lapping neighbor set effect. In our model we assumed that 6. Handling consistency problems Table 3 The table states the derivation results of the probability of occurrence of a Theoretically, as seen from Eq. (16), in HPC5 the proba- 4-length cycle around a peer due to each of the above stated cases. The bility that a peer will receive duplicate queries due to third column indicates the number of ways of forming a cycle for a inconsistent updates is quite small. For a network with particular case. 200k peers and 30k ultra-peers, the simulation results Case # Prob. of occurrence # of ways show that the total number of short length cycles formed 1 1 Nrem 1 in the network, when 10% of the total peer departs, are 2 ^uu d 4 only around 1025. This implies that the number of redun- 2N rem 3 ^uu d 4 dant messages generated by any peer in HPC5 due to Nrem 4 ^2 d 2 inconsistency is quite less, and this level of inconsistency uu 2N rem can arise only if all the 10% of the nodes attempt to regain 5 ^ ðd duu ^ 1Þ uu 2 2Nrem their links simultaneously. However, to handle such rare 6 ^ d uu 2 4 flash crowd type situation, we propose an inconsistency 2N rem 7 ^2 d uu 3 handling procedure to handle these inconsistencies. Every Nrem ultra-peer maintains a record of the number of duplicate 1450 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 queries received over the total number of received queries Estimate the mean  x and variance S2 of the in a particular time interval, a ratio that we term as dupli- duplicate query ratio of the network from the cate query ratio. When the duplicate query ratio at an ultra- values of di peer P (say dP ) crosses a threshold value (say dt ), the peer Test the hypotheses H0 : l ¼ lc ; H1 : l < lc for a considers that inconsistency may have occurred in the net- given level of significance a work. Hence it becomes a coordinator and initiates an if H1 == FALSE then inconsistency handling protocol. Major objectives of the Initiate CYCLE_DETECTION algorithm inconsistency handling protocol are to predict whether else the extent of inconsistency has crossed a limit; in that case, Do Nothing detect the short length cycles that may have arisen due to end if inconsistency, and subsequently remove them by dropping end if a suitable link. The inconsistency handling protocol is di- vided into three parts: Inconsistency estimation, cycle detection and cycle removal. The algorithm is inherently distributed in nature and each node runs the algorithm Algorithm 2. Algorithm CYCLE DETECTION. This separately. We discuss each of the parts precisely. algorithm detects the short length cycles through any peer P. Although this algorithm is initiated by the peer P, but this is a distributed algorithm and 6.1. Inconsistency estimation is run by each peer through which the CYCLE- DETECT control message passes The initial task of the coordinator – i.e. the peer that realizes that its duplicate query ratio has crossed the SeqNo :¼ Message Sequence Number threshold, dt – is to predict, whether the inconsistency in IDðiÞ :¼ ID of peer i the network has also crossed a particular threshold. We re- T :¼ Message Time Stamp fer percentage inconsistency in the network as the per- NðiÞ :¼ Neighbor set of peer i centage of peers that are part of any short length cycle. Send a CYCLE DETECT control message with TTLð2Þ As observed from the simulation results, the average dupli- to neighbors NðPÞ. cate ratio of the network (say l) is highly correlated to the if A peer Q receives a CYCLE DETECT message percentage of inconsistency in the network (refer Appen- created by some other peer then dix A). Thus, locally estimating the value of l helps in pre- Stamp IDðQ Þ in the message dicting the percentage inconsistency in the entire network. Stamp NðQ Þ in the message Since it is difficult for the coordinator to obtain the average Save SeqNo, T, and ID of adjacent peer (:¼ arrID) duplicate ratio of the whole network, l, the coordinator from whom the message has arrived predicts l from the duplicate query ratio of the neighbor- if the CYCLE DETECT message is NOT a DUPLICATE ing peers. Thus to obtain this estimate, the peer queries the message then neighbors about their duplicate query ratio, calculated If TTL ¼¼ 1 then set TTL :¼ 0 and forward to over a specific interval of time, by sending a query mes- neighbors NðQ Þ, else do nothing sage. The neighboring peers respond by sending their indi- else vidual duplicate query ratios, and subsequently, an if TTL ¼¼ 0 then estimated average duplicate ratio,  x, is calculated. Using Set TTL :¼ 2 and send the duplicate message  x, the average duplicate ratio, l, is tested with a given level through arrID from where the message arrived of significance, a ¼ 0:05, to determine whether l is less else than a threshold value lc . If the average duplicate query Set TTL :¼ 1 and forward the duplicate ratio, l, crosses the threshold, lc , the cycle detection and message through arrID from where the message cycle removal algorithms are initiated to detect and re- arrived and also to other neighbors NðQ Þ move the three or four length cycles, respectively. end if end if Algorithm 1. Algorithm end if INCONSISTENCY ESTIMATION. This algorithm if Peer P receives duplicate CYCLE DETECT messages estimates the average duplicate ratio of the sent by itself then network and checks whether it exceeds a Initiate CYCLE REMOVE algorithm threshold. Depending upon the results, it initiates end if CYCLE DETECTION algorithm dP = duplicate query ratio at peer P dt = threshold duplicate query ratio 6.2. Cycle detection lc = threshold average query ratio The major objective of this algorithm is to detect the if dP > dt then edges of the cycles that have arisen due to inconsistency. Send query to all neighbors requesting their If l P lc for significance level a, then the coordinator at- duplicate query ratio di tempts to detect those nodes that are involved in a short J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1451 length cycle with it by initiating this algorithm. For this, du is the sum of the maximum number of ultra and leaf- the coordinator sends a special TTLð2Þ control message, neighbors of an ultra-peer. which we name as CYCLE-DETECT message, to its adjacent peers along with a time-stamp. On receiving this message, Algorithm 3. Algorithm CYCLE REMOVE. This each adjacent peer records in its own table, the message algorithm is initiated by a coordinator peer P to sequence number, the time-stamp, and the id of the peer remove a suitable cyclic link of an already from which the message has arrived. This record helps in detected short length cycle in order to break the identifying duplicate messages. Further, each adjacent peer cycle attaches in the message, its id and the number of its first Extract the cyclic path p from the message through neighbors, and forwards the message to its adjacent neigh- which the peer is traversed bors – other than the one from whom it has received the Find the sum of the neighbor count of the endpoint message – reducing the TTL by 1. If a peer receives a dupli- peers of every link of cycle p cate CYCLE-DETECT message, it implies the existence of a Select the link lij with minimum sum that connects short length cycle. The information about the nodes in peers i and j the cycle has to be sent to the coordinator. Hence, the peer if lij connects to a leaf-peer then receiving the duplicate control message attaches its id and the number of its first neighbors in each of the received repeat message, and sends back each message through the same Select the link lij in path p with next minimum path from which the message had arrived. Since each peer sum has recorded the sending peer, they forward the duplicate until i and j are both ultra-peers control message through the same path in the reverse end if direction towards the coordinator peer. The messages fol- Send message to peers i and j to drop the link low the reverse path and reach the coordinator who there- by obtains the topology information. It then runs the short length cycle removal algorithm to identify those links that should be dropped to remove all short length cycles. 7. Evaluation by simulation To validate our approach, we have performed numerous 6.3. Cycle removal experiments. We have taken different sizes (up to 100k nodes) of networks and performed experiments on those The cycle removal algorithm is simple; the coordinator networks several times to obtain the average behavior. decides to drop the link that will affect the minimum num- Through these experiments, we have seen that HPC5 per- ber of nodes. When a coordinator receives back the dupli- forms better than the existing protocols. cate CYCLE-DETECT control messages sent by itself, the coordinating peer extracts the information about the num- 7.1. Search performance ber of first hop neighbors of each node from the control messages. To decide which link to drop, for every link it The efficiency of search algorithms can be measured calculates the number of peers that will use the link by tak- using various metrics such as success rate, average num- ing the sum of the first hop neighbor count of the two ber of hops required to get results (response time), mes- nodes of the link, and then drops the link with the mini- sage complexity, network coverage (number of nodes mum sum value. However, if the link connects to a leaf- discovered), percentage of message duplication, etc. peer, then it does not drop that link and considers the next [12]. In our simulations, we use message complexity best link that connects two ultra-peers. Links to leaf-peers and network coverage as performance metrics. Message are not disconnected as leaf-peers are connected to only complexity is defined as the average number of messages three or four ultra-peers, and reducing their degree by required to discover a peer in the overlay network. Net- one will have a big impact on their performance. The coor- work coverage implies the number of unique peers ex- dinator peer sends a message to all the peers involved in plored during query propagation in limited flooding. the cycle informing the link to be dropped. All the peers We plot the network coverage and message complexity update this information and the peers corresponding to (y-axis) with TTLð2Þ and TTLð3Þ flooding against the size the link propagate this information to all their neighbors. of the network (x-axis). We also show the performance The pseudocode is presented in Algorithm 3. The perfor- separately for the query originating from leaf-peers and mance of the algorithm depends upon the proper choice ultra-peers. To get the overall performance of the net- of the threshold average duplicate query ratio lc . If the work, we choose the number of ultra-peers and leaf- threshold is kept too high, the number of redundant mes- peers for query flooding in the same UL ratio. The perfor- sages in the network might increase enormously, whereas, mance of the network is greatly influenced by the value a lower value might lead to unnecessary deletion of links of TTL used in search and thus we discuss the perfor- leading to a degradation in performance. Thus we need mance metrics based on TTLð2Þ and TTLð3Þ separately. to have a tradeoff. In Section 7.6, we experimentally derive The search performance (specially message complexity) an idea of lc for certain request patterns. It can be seen also depends on the implementation of QRP technique that the asymptotic bound on the number of messages [1]. Thus we discuss the search performance without 2 transmitted for handling an inconsistency is Oðdu Þ, where and with QRP. 1452 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 7.2. TTL(2) without QRP double of the coverage attained in cycle-3 networks. If any pair of ultra-neighbors of a particular leaf-peer is not It is clear from Fig. 9a and b that with TTLð2Þ, cycle-5 more than 6 hops apart from each other (in case of networks are better than cycle-3 networks in both mes- TTL(2) it was 4 hops), then query generates redundant sage complexity and network coverage. In cycle-5 net- messages. In cycle-5 networks the probability of forming works, the network coverage is approximately doubled cycle-5 and cycle-6 is very high. As a result, the message and message complexity is almost 20% less than that of cy- complexity of cycle-5 networks becomes higher than cy- cle-3 networks. With TTLð2Þ, a search query covers a signif- cle-3 networks (Fig. 11b). As mentioned earlier, Gnutella icant portion (in our simulation it is more than 30%) of the (Limewire etc.) uses TTLð3Þ in dynamic querying only for cycle-5 network with lesser number of redundant mes- rare searches [1,2]. Therefore larger network coverage in sages. From the results we see that the message complex- this case, which increases the query hit probability, is more ity is not as close to 1 as expected. This is because the essential than slight increase of message complexity. message complexity of the leaf-peer generated query is particularly high (Fig. 9b). In cycle-5 networks, a leaf-peer 7.4. TTL(2) and TTL(3) with QRP can be connected with two ultra-peers which are them- selves 3rd or 4th neighbors of each-other and becomes part With QRP technique, searching is performed only at the of cycle-5 or cycle-6 (Fig. 9). From Fig. 9 we see, a leaf-peer ultra-peer layer, since ultra-peers contain the indices of search is initiated by its ultra-peers; both ultra-peers 1 and their children [1,2]. So, the measurement of message com- 4 (in Fig. 9a) {1 and 5 (in Fig. 9b)} start a TTLð2Þ flooding. plexity at the ultra-peer layer is more appropriate to com- Consequently redundant messages are produced at ultra- pare results with Gnutella networks. The ultra-peer layer peers 2 and 3. However, hardly any redundancy is gener- message complexity is shown in Fig. 12. Simulation reflects ated in ultra-peer initiated query (Fig. 10b). that the message complexity in TTLð2Þ of cycle-3 networks As a result, cycle-5 networks generate a large number of is almost 2–2.5 times than that of cycle-5 networks. Even redundant messages which get reflected in Fig. 10b. in TTLð3Þ search, cycle-3 networks generate 25% more mes- sages than that of cycle-5 networks. So HPC5 protocol will be more effective in the Gnutella network in the presence 7.3. TTL(3) without QRP of QRP protocol. Fig. 11a shows that in our simulation, the entire cycle-5 networks is covered with TTLð3Þ search which is almost 7.5. Comparison of robustness 1 2 3 4 1 2 3 4 5 In this section we compare the robustness of the evolved networks with that of the base network. In order to test the robustness of the network evolved through HPC5, we have considered networks with 50k peers and (a) plotted the percentage of existing peers belonging to the (b) largest component against the percentage of peers re- Fig. 9. Effect of leaf-peer layer with TTL(2) search. The arrows inside and moved. We have removed peers in two ways: (i) random outside of polygons indicate the directions of search by a leaf-peer and an removal and (ii) pathologically removing the highest-de- ultra-peer, respectively. gree nodes first [22]. The upper (right) two curves in TTL-2: Message Complexity 2.5 Message Complexity TTL-2: Network Coverage 2 30000 Network Coverage 1.5 20000 1 10000 20000 40000 60000 80000 100000 Number of Peers Ultra-peer:Cycle-3 Ultra-peer:Cycle-5 0 20000 40000 60000 80000 100000 Leaf-peer:Cycle-3 Leaf-peer:Cycle-5 Overall Cycle-3 Overall Cycle-5 Number of Peers Fig. 10. Network coverage and message complexity with TTL 2 for cycle-3 and cycle-5 networks. J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1453 TTL-3: Message Complexity 4 Message Complexity TTL-3: Network Coverage 3 100000 80000 Network Coverage 2 60000 40000 1 20000 40000 60000 80000 100000 20000 Number of Peers Ultra-peer:Cycle-3 Ultra-peer:Cycle-5 0 20000 40000 60000 80000 100000 Leaf-peer:Cycle-3 Leaf-peer:Cycle-5 Overall Cycle-3 Overall Cycle-5 Number of Peers Fig. 11. Network coverage and message complexity with TTL 3 for cycle-3 and cycle-5 networks. TTL(2) TTL(3) Mesg. Complexity at Ultra-peer layer Robustness on Random and Pathological Removal 10 Size of the Largest Component (%) 100 TTL(2):Cycle-3 90 TTL(3):Cycle-3 Message Complexity 8 80 TTL(2)::Cycle-5 TTL(3) Cycle-5 70 6 60 50 4 40 Random in Cycle-3 30 Random in Cycle-5 Patho. in Cycle-3 2 20 Patho. in Cycle-5 10 20000 40000 60000 80000 100000 0 Number of Peers 1 10 100 Percentage of nodes removed Fig. 12. Message complexity with TTL(2) and TTL(3) at the ultra-peer layer. Fig. 13. Comparison of robustness between cycle-3 and cycle-5 networks. Fig. 13 represent the effect of random removal, whereas like, the number of ultra and leaf-neighbors, or the total lower (left) two curves represent pathological removal. number of neighbors of any peer. We denote the percent- Both cycle-3 and cycle-5 networks are extremely resilient age of inconsistent peers in the network as the percentage to random removal and the largest connected component of peers whose neighbor lists have been modified explicitly still contains 80% of existing peers even after removing al- by us so as to form short length cycles. We initially state most 75–80% of nodes. The network gets fragmented after the simulation results showing the impact of inconsistent 85% peer removal. But in pathological removal, both cycle- peers on the average duplicate ratio of the whole network 3 and cycle-5 networks get fragmented only after remov- and then discuss the effectiveness of our proposed incon- ing 6–7% of total nodes. However, through our empirical sistency handling algorithm. study, we can conclude that robustness is similar for both cycle-3 and cycle-5 networks. 7.7. Impact of inconsistency on average duplicate ratio 7.6. Inconsistency handling We simulated the average duplicate ratio of the net- work with respect to the percentage inconsistency in the We simulated the average duplicate ratio at the peers in network for a total node count of 125k and 150k. Every the whole network for certain percentages of inconsistent peer in the network was considered as equally active, i.e. peers in the network. A peer is termed as inconsistent if each peer had the same average query sending rate. it is involved in a short length cycle. To simulate the effect Fig. 14 shows the results with and without inconsistency of inconsistent peers in the network, we re-organized the handling. As can be seen, when no inconsistency handling cycle-5 topology of the simulated Gnutella network to is applied, the average duplicate ratio of the network in- form short length cycles without changing the parameters creases rapidly with increasing percentage inconsistency. 1454 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 Average Duplicate Ratio for Inconsistent Gnutella Networks the peer inconsistency has crossed 2%. The percentage 0.008 No Inconsistency Handling 125K inconsistency of the network has been varied from 1% to No Inconsistency Handling 150K With Inconsistency Handling 125K 15% for a total network size of 125k peers. The results 0.007 With Inconsistency Handling 150K shown in Fig. 15 reveal that the percentage correctness Average Duplicate Ratio of Network 0.006 of the estimations are nearly 100%, except for some incon- sistencies very near to 2%. This is because, for low percent- 0.005 age inconsistency, the actual average duplicate ratio of the 0.004 network is very close to 0.002; hence the number of incor- rect estimations are higher compared to the case when the 0.003 actual average duplicate ratio is much higher than the threshold. 0.002 The efficiency of the cycle detection and removal algo- 0.001 rithms is shown in Fig. 14. As the figure shows, the cycle detection and removal algorithm successfully reduces the 0 0 2 4 6 8 10 12 14 average duplicate ratio of the whole network to the thresh- Inconsistent Peers (%) old bound of 0.002 for all percentages of peer inconsis- tency. However, the cycle removal algorithm deletes Fig. 14. The graph plots the change in average duplicate ratio of the peers l in the network with increasing percentage of minimum inconsistent peers. The number of peers used in the simulation are 125k and 150k. The graph shows the results when no inconsistency handling is done and also for the case when inconsistency handling is applied beyond a threshold of No. of peers = 125K, Critical Duplicate Ratio = .002 100 average duplicate ratio lc ¼ 0:002. % Correctness of Estimated Results 80 From the Pearson product-moment correlational coefficient of the simulated results (refer Appendix A), we find that 60 the average duplicate ratio and the percentage inconsis- tency of the network exhibit a very high linear depen- 40 dency. Thus the average duplicate ratio of the network can be used to predict the percentage inconsistency in 20 the network, i.e. we can claim that when the average dupli- cate ratio of the peers in the network, l, exceeds a thresh- old lc , it implies that the percentage inconsistency in the 0 0 2 4 6 8 10 12 14 16 network, i, has also exceeded a threshold value of ic and Inconsistent Peers (%) vice versa. Hence, our proposed inconsistency handling algorithm estimates the average duplicate ratio of the Fig. 15. Percentage correctness of the test, whether the percentage inconsistency i has crossed a threshold value ic ¼ 2%, when the average network to obtain an idea about the percentage inconsis- duplicate ratio of the network has crossed a threshold value lc ¼ 0:002 tency in the network. We, now state the simulation results for 0.05 level of significance and vice versa. The percentage inconsistency showing the efficiency of our proposed inconsistency han- was varied from 1% to 15%. The total number of peers used was 125k. dling algorithm in predicting and reducing the average duplicate ratio of the whole network. 7.8. Efficiency of the proposed inconsistency handling Percent Change in Average Degree of Peers algorithm 5 % Change in Average Degree of Peers We measure the efficiency of the proposed inconsis- 4 tency algorithm by measuring the percentage correctness of inconsistency estimation by the peers and by the reduc- 3 tion in the average duplicate ratio of the network for differ- ent percentage of peer inconsistency. The inconsistency handling algorithm was divided into three parts: inconsis- 2 tency estimation, cycle detection, and cycle removal. We simulated our inconsistency estimation algorithm and ob- 1 tained the results for percentage correctness of the estima- tion – whether the percentage inconsistency i has crossed 0 a threshold value ic ¼ 2%, when the average duplicate ra- 0 2 4 6 8 10 12 14 16 tio of the network has crossed a threshold value Inconsistent Peers (%) lc ¼ 0:002, for 0.05 level of significance, and vice versa. Fig. 16. The graph shows the change in the average degree of the peers In our simulations, we observed that the average duplicate when the inconsistency handling protocol is applied on the peers when ratio is 0.002, when the peer inconsistency is nearly 2%. the average duplicate ratio l of the network crosses the threshold Thus, the threshold value lc ¼ 0:002 predicts whether lc ¼ 0:002. The number of peers used in the simulation was 125k. J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1455 certain links in the network to remove inconsistencies. short length cycle formation by selecting suitable neigh- Hence we need to study the effect of link removal on the bors, rather than removing edges, the network coverage average degree of the peers. Fig. 16 shows the percentage is also higher as compared to DCMP. However, DCMP change in the average degree of the peers following cycle works for any TTL values and it has been shown to perform removal algorithm. The maximum reduction of the average well for TTL values upto 8, whereas, HPC5 supports only degree is around 4%. Since from Table 2, we find that the TTLð2Þ flooding. We implemented the DCMP protocol on average degree of the ultra-peers is duu þ dul  48, hence our simulator; the comparison of the message overhead the number of edges that might be removed per peer is ex- and the network coverage with HPC5 is detailed next. pected to be around 1–2, at most. 7.10. Comparison of message overhead We next compare the performance of the HPC5 protocol with the Distributed Cycle Minimization Protocol (DCMP) for We measured the message overhead as the total num- p2p networks. ber of control messages sent over the network for a certain period of time. We compare the message overhead for both 7.9. Comparison with DCMP HPC5 and DCMP with respect to the size of the network (number of nodes). The list of various simulation parame- In this section, we compare the performance of DCMP ters that we have used for the simulations are as follows: with HPC5 in terms of message overhead and the network coverage of the peers. The DCMP protocol reduces traffic 1. We considered that every peer sends an average of 3.6 redundancy by detecting the short length cycles dynami- query requests per hour distributed uniformly (as con- cally and deleting suitable edges to remove these cycles. sidered in the DCMP paper [25]), and run the simulation In Algorithm 4 (Appendix C), we highlight the major steps for two hours of simulation time and various network of the DCMP algorithm for ready reference. The perfor- sizes. mance comparison is based on the simulation results that 2. For both HPC5 and DCMP, we used a growing network we obtain from implementing the DCMP protocol in our where peers arrive in a burst of 25k after every simulator. In the DCMP paper [25], results have been given 10 min. The growth process in each case is similar and for at most 10k nodes, without considering certain specific is discussed in Section 2. details of the Gnutella network (like average degrees of the 3. The minimum Gnutella message size is 23 bytes [16]; ultra and leaf-peers, QRP, etc.). Hence to make the compar- we consider each Node Information Vector (NIV) field ison with HPC5, we deemed fit to implement the DCMP of a peer A of length equal to six times the number of protocol in the simulation platform. neighbors of A. This is because, for each neighbor of A, The DCMP protocol follows a lazy approach in detecting 4 bytes is allocated to represent the address of A and and removing cycles in Gnutella networks; however, as the 2 bytes for representing the bandwidth of A, making a gatepeers and transitive peers constantly broadcast tagged total of 6 bytes. messages to their TTL-hop neighbors, a fixed number of 4. In our experiments, we ignore the overhead in DCMP control messages will always flow through the network. due to dissemination of gatepeer information in the Another problem with the DCMP protocol is that the dele- form of tagged messages. tion of edges to remove cycles might reduce the network 5. For the case of HPC5, we assume that a peer sends a coverage of the peers in the network. In contrast, HPC5 control message of 23 bytes for a connection request generates control messages only at the bootstrapping to a peer. The peer responds with the list of addresses phase, when a node joins a network. Since HPC5 avoids Total Control Messages Transferred (in Megabytes) Overhead Comparison of DCMP and HPC5 Comparison of Average Network Coverage : DCMP vs HPC5 (With QRP) 6 DCMP HPC5: Without Link Removal Average Network Coverage Per Peer 450 DCMP 800 3600 HPC5 HPC5: With Link Removal HPC5 DCMP TTL d =2 400 700 1800 4 600 350 25 50 75 100 500 300 25 50 75 100 400 250 300 200 200 150 100 30 40 50 60 70 80 90 100 30 40 50 60 70 80 90 100 Total Number of Peers in Network (in Thousands) No. of Peers (In Thousands) Fig. 17. (a) Shows the total number of control messages transferred (in megabytes) in the whole network. The figure in inset shows the average number of control messages sent per peer (in kilobytes). (b) Compares the average network coverage of both leaf and ultra-peers in DCMP and HPC5. The figure in inset shows the coverage of the leaf-peers, and the main figure shows the average networks coverage of the ultra-peers in a Gnutella network. The number of peers in the simulations is varied from 30k to 100k. 1456 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 of its neighbors; each address along with the port num- messages can be reduced, if we suitably eliminate short ber is assumed to be of 6 bytes. An additional overhead length cycles from the network topology. We have shown of 23 bytes, which is the minimum Gnutella message that the protocol is far more efficient than existing proto- size is also assumed for the latter case. cols. We also studied the various hurdles in implementing our scheme and proposed methods to handle inconsistency As can be seen from Fig. 17a, for a large network size in the network. The basic strength of our protocol lies in its ðN ¼ 100kÞ, in HPC5 the total number of control messages simplicity and the ease with which it can be implemented sent through the network (measured in bytes) is less by al- on existing Gnutella networks. The modularity of the most 30% compared to DCMP. This is because, in HPC5, the handshaking and the inconsistency handling protocol pro- overhead is mainly due to the response messages sent by vides a suitable choice to the designers whereby they can the peers in reply to a connection request. With increasing decide on the use of any of these protocols without affect- number of nodes, the probability of finding a valid peer at a ing the other one. Moreover, the inconsistency handling particular attempt increases; thus peers require to send protocol can be tuned dynamically. Our protocol can be fewer connection requests, thereby reducing the number instrumental in improving the scalability of Gnutella net- of control messages. Hence, with increasing network size, works; moreover, this concept of topology management the average number of control messages sent per peer de- can be customized for other unstructured p2p networks creases much faster compared to DCMP (refer inset in to improve blind search behavior, resulting it in perform- Fig. 17a). ing comparably with many proposed intelligent search mechanisms that are generally more computation intensive. 7.11. Comparison of network coverage We also measured the average TTLð2Þ network coverage Appendix A. Relation between average duplicate ratio of both ultra and leaf-peers, for both DCMP and HPC5, and inconsistent peers when QRP is followed. We assume that DCMP preserves all cycles of length greater than 4 and eliminates all 3 The following table indicates the result of the simula- and 4-length cycles, i.e. TTLd ¼ 2 [25]. We compare the re- tions. If X and Y denote two parameters of an experiment sults of DCMP with two possible cases of HPC5. and we take n measurements of the parameters denoted by (xi ; yi ), where i ¼ 1; 2; . . . n, then the Pearson Correlation 1. In the first case, we assume that HPC5 does not remove Coefficent between X and Y is given by (see Table 4) any link, i.e. HPC5 does not initiate the inconsistency n xi yi  xi yi P P P handling algorithm. r xy ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi; 2. In the second case, to compare DCMP with the worst q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n x2i  ð xi Þ2 n y2i  ð yi Þ2 P P P P case performance scenario of HPC5, we remove 4% of the total links (as seen in Fig. 16), but from the ultra- peer level of the network. This simulates the possible where all the summations are carried for i ¼ 1 to n. For scenario that a large number of links have been the given data, let X denote the percentage of inconsistent removed by the inconsistency handling algorithm of peers and Y denote the average duplicate ratio at the HPC5, due to inconsistency. Removing 4% of the total peers; we calculated the value of r xy for 120k peers and links reduced the average degree of the ultra-peers by as much as 4%. Table 4 The table shows the simulation results of the average duplicate ratio at the As observed in Fig. 17b, in both cases, DCMP has much peers for a given percentage of inconsistent peers. The total number of lower network coverage as compared to HPC5. This is be- peers used in the simulation are 125k and 150k. cause, Gnutella networks form a large number of 3 and 120k 150k 4-length cycles; hence in DCMP, a high number of dupli- cate messages are generated at each peer. Thus DCMP % Average % Average Inconsistent duplicate ratio Inconsistent duplicate ratio eliminates a large number of links (we observed that for peers peers TTLd ¼ 2, DCMP eliminates about 16% of the total links), 1 0.001889 1 0.002005 thereby reducing the average network coverage of the 2 0.002019 2 0.002171 peers. 3 0.002336 3 0.002388 4 0.002597 4 0.002654 5 0.003103 5 0.003124 8. Conclusion and future work 6 0.003366 6 0.003376 7 0.003786 7 0.003717 In this paper, we have presented a handshake protocol 8 0.004328 8 0.004315 9 0.004812 9 0.004661 which is compatible with Gnutella-like unstructured 10 0.005406 10 0.005374 two-tier overlay topology. The objective of the protocol is 11 0.005820 11 0.005725 to enhance the network coverage of the peers, besides 12 0.006051 12 0.005991 reducing redundant messages, in the existing Gnutella net- 13 0.006066 13 0.006020 14 0.008082 14 0.005996 works. Our protocol is based on the observation that net- 15 0.006214 15 0.006126 work coverage of peers can be improved, and redundant J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1457 150k which is around 0.984 for both cases. Since the val- the first neighbors leads to further SðQ Þ unique second neigh- ues of r xy are near to 1 for both cases, so it shows that bors of Q, then for a given value of FðQ Þ þ LðQ Þ, the network there is a very high linear dependency between the two coverage of Q, given by CðQ Þ ¼ FðQ Þ þ SðQ Þ, is maximum, if parameters. Thus the peers can use the duplicate ratio and only if Q is not involved in any cycle of length less than 5. of itself and the neighboring peers to estimate the aver- age duplicate ratio of the whole network based on which Proof 2. If-part: If Q is not involved in any cycle of length less it can decide whether to initiate the inconsistency han- than 5, then CðQ Þ is maximum. The if-part of the proof is triv- dling protocol. ial and is directly obtained from Lemma 1. Since according to Lemma 1, if the minimal length cycle from Q through any Appendix B. Importance of cycle-5 networks for node v 2 V has length 5, and Dðv Þ 6 2, a TTLð2Þ flood mes- gnutella sage from Q must reach v exactly once. This implies that all the nodes within two hops from Q are reached exactly once Gnutella networks use a TTLð2Þ query for most searches; from Q, i.e. all the LðQ Þ edges from first hop neighbors of Q however, for rare searches it uses a TTL value of 3 (dynamic must reach to different and unique nodes. Hence, the num- querying [5]). A TTLð2Þ query from a peer reaches its first as ber of unique second neighbors SðQ Þ reached is given by well as its second neighbors. Thus, the network coverage of SðQ Þ ¼ LðQ Þ. Since SðQ Þ cannot be greater than LðQ Þ, hence, a peer may be defined as the sum of the set of its first and for a given FðQ Þ þ LðQ Þ, the coverage of node Q will be max- second neighbors. Suppose a peer, P, in a p2p network has imum and equal to CðQ Þ ¼ FðQ Þ þ LðQ Þ. Thus it is proved FðPÞ number of first neighbors, and let LðPÞ denote the that if Q is not involved in any cycle of length less than 5, number of outgoing edges from FðPÞ first neighbors of P. then CðQ Þ is maximum. Moreover, suppose LðPÞ edges connect to SðPÞ unique peers Only-if part: If CðQ Þ is maximum, then Q is not involved in other than P or any of its first neighbors. Then the network any cycle of length less than 5. We prove the above coverage of P is given by the sum FðPÞ þ SðPÞ. We prove statement by method of contradiction. Suppose CðQ Þ is that, for given FðPÞ and LðPÞ, the peer P, that uses TTLð2Þ maximum for a given FðQ Þ þ LðQ Þ, when Q is involved in search queries, will have a maximum network coverage, some cycles of length less than 5. We note that, any pair of if the peer P does not involve itself in any cycle of length peers will be connected by a single edge only (no parallel less than five. Moreover, with the above stated condition, edges occur), so there is no possibility of any cycle of the peer P does not produce any redundant message at length 2. Thus Q can be involved in cycle of either length 3 any of its first or second neighbors. We prove each of the or 4. Let us initially assume that Q is involved in a cycle of statements using a lemma and a theorem as stated below. length 3. If Q is involved in cycle of length 3, then any two In each case, we represent the network as a graph GhV; Ei, of its first neighbors (say N1 and N 2 ) must be connected by where the vertex set V represents the peers in the network an edge E12 among themselves. We remove the end of edge and the edge set E represents the overlay links between E12 connected to N 1 , and connect it to a new node, N a , that them. is not presently a first or second neighbor of Q. Thus, for same value of FðQ Þ þ LðQ Þ, the number of second neighbors Lemma 1. Given a network, represented as, GhV; Ei with a of Q increases by 1; hence we have the new coverage source vertex Q. A TTLð2Þ flood query from the source C 0 ðQ Þ > CðQ Þ, which is not possible as we have already vertex Q will reach a vertex v 2 V exactly once, if the assumed that CðQ Þ is maximum. Thus, our assumption was minimum length cycle through Q and v has length 5, and wrong and Q cannot be involved in a cycle of length 3. the minimum distance of v from Q, given by DQ ðv Þ, is less than or equal to 2. Now suppose we assume that Q is involved in a cycle of length 4. Then any two of its first neighbors (say again N 1 and Proof 1. Since the minimum distance of node v from Q, N 2 ) must be connected to a common peer, say N 3 . We remove given as DQ ðv Þ 6 2, so node v must be reachable from Q the link connecting N 2 and N 3 , and add a new link from N 2 to using a TTLð2Þ flood query. We now prove that if there a new peer N b that is not currently a first or second neighbor exists a cycle between Q and v of minimum length 5 and of Q. Thus, as in the previous case, for a given value of DQ ðv Þ 6 2, then there will be no duplicate query at vertex FðQ Þ þ LðQ Þ, the number of second neighbor of Q increases by v from source Q. We prove this by method of contradiction. 1; hence the new coverage C 0 ðQ Þ > CðQ Þ, which is not Let us assume that node v receives a duplicate query from possible as we have already assumed CðQ Þ as maximum. source Q through two separate paths, say p1 and p2 . Then, Hence Q cannot be involved in any cycle of length 4. the path length between Q and v through each of these Thus for CðQ Þ to be maximum, Q cannot be involved in paths, p1 or p2 , is at most 2. Thus the cycle represented by any cycle of length less than 5. h Q ,p1 v ,p2 Q is at most of length 4 which is a contradiction to our initial assumption that Q and v are connected The generalization of the proof can be thought as, for a through a cycle of minimum length 5. Thus no redundant given number of edges in a p2p network, the network cov- messages can be produced at v. h erage of the peers can be enhanced, if we can increasingly eliminate short length cycles from the network. Theorem 1. Given a network, represented as GhV; Ei, with a source vertex Q that uses TTLð2Þ based flooding mechanism Appendix C. Steps of algorithm of the DCMP protocol for searching. Let FðQ Þ be the number of first neighbors of Q and LðQ Þ denote the total number of outgoing edges from the We present an algorithmic outline (Algorithm 4) of the FðQ Þ first neighbors. If LðQ Þ number of outgoing edges from DCMP protocol for ready reference. 1458 J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 References Algorithm 4. Algorithm DCMP. Each message in DCMP is assigned a globally unique ID (GUID), [1] Gnutella and Limewire: <URL www.limewire.org>. every peer A on receiving a message from B [2] Gnutella protocol specification 0.6: <http://rfc-gnutella.sourceforge. net>. records the GUID and its direction of travel [3] Gnutella: <URL www.gnutellaforums.com>. ðB ! AÞ in a history table. The algorithm is [4] Gwebcache system: <URL www.gnucleus.com>. initiated by a peer A that receives a duplicate [5] How gnutella works, <URL wiki.limewire.org>. [6] G. Chen, C.P. Low, Z. Yang, Enhancing search performance in message (two messages with same GUID) from unstructured P2P networks based on users’ common interest, IEEE two different directions, say B and C Transactions on Parallel and Distributed Systems 19 (6) (2008) 821– 836. GUID :¼ Globally Unique ID of the duplicate [7] M. Jelasity, S. Voulgaris, R. Guerraoui, A.-M. Kermarrec, M. van Steen, message Gossip-based peer sampling, ACM Transactions on Computer DetectionID :¼ The direction of the duplicate Systems 25 (3) (2007) 8. [8] P. Karbhari, M.H. Ammar, A. Dhamdhere, H. Raj, G.F. Riley, E.W. message, (say B ! A) Zegura, Bootstrapping in Gnutella: a measurement study, in: PAM, NIVðAÞ :¼ Node Information Vector of peer A Lecture Notes in Computer Science, vol. 3015, Springer, 2004, pp. containing the IP address, bandwidth, cpu speed, 22–32. [9] Y. Liu, X. Liu, L. Xiao, L.M. Ni, X. Zhang, Location-aware topology degree and the neighbors of A matching in P2P systems, in: INFOCOM: The Conference on On receiving duplicate messages from a source Q Computer Communications, Joint Conference of the IEEE Computer through 2 different adjacent nodes, say B and C, and Communications Societies, 2004. [10] Y. Liu, L. Xiao, X. Liu, L.M. Ni, X. Zhang, Location awareness in peer A introduces a new message called, unstructured peer-to-peer systems, IEEE-TPDS: IEEE Transactions on Information Collecting (IC) message with same Parallel and Distributed Systems 16 (2005) 163–174. GUID containing DetectionID; NIVðAÞ and [11] K. Lua, J. Crowcroft, M. Pias, R. Sharma, S. Lim, A survey and forwards it to both B and C. comparison of peer-to-peer overlay network schemes, Communications Surveys & Tutorials, IEEE (2005) 72–93. For each node i on receiving the IC message [12] Q. Lv, P. Cao, E. Cohen, K. Li, S. Shenker, Search and replication in repeat unstructured peer-to-peer networks, in: Proceedings of the 2002 Record the GUID and DetectionID International Conference on Supercomputing (16th ICS’02), ACM, 2002, pp. 84–95. Add its own NIV, i.e. NIVðiÞ [13] S. Merugu, S. Srinivasan, E.W. Zegura, Adding structure to Propagate the IC message in reverse direction unstructured peer-to-peer networks: the role of overlay topology, from which the original message arrived. in: NGC 2003, and ICQT 2003, Proceedings, Lecture Notes in Computer Science, vol. 2816, Springer, 2003, pp. 83–94. until A node i receives both IC messages [14] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, U. Alon, if peer X receives both IC messages then Network motifs: simple building blocks of complex networks, peer X checks the NIVs’ of each peer in message Science 298 (5594) (2002) 824–827. [15] C. Papadakis, P. Fragopoulou, E. Athanasopoulos, M.D. Dikaiakos, A. and finds the most powerful peer (Gatepeer) Labrinidis, E. Markatos, A feedback-based approach to reduce depending upon the degree, bandwidth and CPU duplicate messages in unstructured peer-to-peer networks, in: speed, Integrated Research in GRID Computing, 2007. [16] M. Portmann, A. Seneviratne, Cost-effective broadcast for fully determines the cut position by finding the link decentralized peer-to-peer networks, Computer Communications whose end nodes are maximal equidistant from 26 (11) (2003) 1159–1167. Gatepeer, [17] S. Saroiu, P.K. Gummadi, S.D. Gribble, A measurement study of peer- sends a Cut Message (CM) to the peers in the cycle to-peer file sharing systems, Tech. rep., July 23 2002. [18] S. Shaw, J. Chandra, N. Ganguly, HPC5: an efficient topology containing the information about the link to cut generation mechanism for gnutella networks, in: 10th and the IP address of the gatepeer. International Conference on Distributed Computing and end if Networking – ICDCN, Hyderabad, 2009. [19] D. Stutzbach, R. Rejaie, Capturing accurate snapshots of the gnutella if the CM arrives at the Gatepeer then network, IEEE INFOCOM, 2005, pp. 2825–2830. then the Gatepeer generates a tagged message [20] D. Stutzbach, R. Rejaie, Understanding churn in peer-to-peer containing the NIV of the gatepeer and the networks, in: IMC ’06: Proceedings of the 6th ACM SIGCOMM conference on Internet Measurement, ACM, New York, NY, USA, number of hops from the message origin to the 2006, pp. 189–202. gatepeer, [21] D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, W. Willinger, On unbiased piggybacks the tagged message with other sampling for unstructured peer-to-peer networks, IEEE/ACM Transactions on Networking 17 (2) (2009) 377–390. messages that are forwarded by the gatepeer, [22] D. Stutzbach, R. Rejaie, S. Sen, Characterizing unstructured periodically sends the tagged message; initially it overlay topologies in modern P2P file-sharing systems., in: is sent frequently, the rate gradually slows down Internet Measurment Conference, USENIX Association, 2005, pp. 49–62. with time. [23] D. Gavidia, S. Voulgaris, M. van Steen, CYCLON: inexpensive end if membership management for unstructured P2P overlays, Journal if a peer Y receives tagged messages from more than of Networks and System Management 13 (2) (2005) 197–217. one direction then [24] S. Zhao, D. Stutzbach, R. Rejaie, Characterizing files in the modern gnutella network: a measurement study, in: Proceedings peer Y becomes a transitive peer, of SPIE/ACM Multimedia Computing and Networking, vol. 6071, it also generates tagged messages and piggybacks 2006. them. [25] Z. Zhenzhou, K. Panos, B. Spiridon, DCMP: a distributed cycle minimization protocol for peer-to-peer networks, in: Parallel and end if Distributed Systems, IEEE Transactions, vol. 19, IEEE, 2008, pp. 363– 377. J. Chandra et al. / Computer Networks 54 (2010) 1440–1459 1459 Joydeep Chandra completed B.Tech. (Com- Niloy Ganguly is an Associate Professor in the puter Science and Engineering) from Haldia Department of Computer Science and Engi- Institute of Technology, India in 2000 and neering, Indian Institute of Technology, M.Tech. (Computer Science and Engineering) Kharagpur. He has received his B.Tech from IIT from Indian Institute if Technology, Kharag- Kharagpur in 1992 and his PhD from BESU, pur in 2002. Presently, he is pursuing his Ph.D. Kolkata, India in 2004. He has spent two years from Indian Institute of Technology, Kharag- as Post-Doctoral Fellow in Technical Univer- pur. His research interests include p2p net- sity, Dresden, Germany before joining IIT, works, distributed algorithms and complex Kharagpur in 2005. His research interests are networks. in peer-to-peer networks in particular and distributed dynamic networks in general. He has applied various bio-inspired techniques to solve problems in such networks. He also works on applying complex networks methodology in various information retrieval problems. He is presently leading a research group comprising of several research scholars and masters student as well as various collaborators from industry and academia. Visit the Complex Networks Research Group Santosh Kumar Shaw completed BE (Com- (CNERG) at www.cse-web.iitkgp.ernet.in/~cnerg/. puter Science and Engineering) from Bengal Engineering College(D.U), India in 2001 and M.Tech (Computer Science and Engineering) from Indian Institute of Technology, Kharag- pur, India in 2008. He served as a faculty member at University Institute of Technology, Burdwan University, India. He also served in Magma Design Automation as Associate Member of Technical Staff. He is presently working with Berkeley Design Automation as a Member of Technical Staff. His research interests includes P2P networks and Distributed algorithms.

References (25)

  1. Gnutella and Limewire: <URL www.limewire.org>.
  2. Gnutella protocol specification 0.6: <http://rfc-gnutella.sourceforge. net>.
  3. Gnutella: <URL www.gnutellaforums.com>.
  4. Gwebcache system: <URL www.gnucleus.com>.
  5. How gnutella works, <URL wiki.limewire.org>.
  6. G. Chen, C.P. Low, Z. Yang, Enhancing search performance in unstructured P2P networks based on users' common interest, IEEE Transactions on Parallel and Distributed Systems 19 (6) (2008) 821- 836.
  7. M. Jelasity, S. Voulgaris, R. Guerraoui, A.-M. Kermarrec, M. van Steen, Gossip-based peer sampling, ACM Transactions on Computer Systems 25 (3) (2007) 8.
  8. P. Karbhari, M.H. Ammar, A. Dhamdhere, H. Raj, G.F. Riley, E.W. Zegura, Bootstrapping in Gnutella: a measurement study, in: PAM, Lecture Notes in Computer Science, vol. 3015, Springer, 2004, pp. 22-32.
  9. Y. Liu, X. Liu, L. Xiao, L.M. Ni, X. Zhang, Location-aware topology matching in P2P systems, in: INFOCOM: The Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 2004.
  10. Y. Liu, L. Xiao, X. Liu, L.M. Ni, X. Zhang, Location awareness in unstructured peer-to-peer systems, IEEE-TPDS: IEEE Transactions on Parallel and Distributed Systems 16 (2005) 163-174.
  11. K. Lua, J. Crowcroft, M. Pias, R. Sharma, S. Lim, A survey and comparison of peer-to-peer overlay network schemes, Communications Surveys & Tutorials, IEEE (2005) 72-93.
  12. Q. Lv, P. Cao, E. Cohen, K. Li, S. Shenker, Search and replication in unstructured peer-to-peer networks, in: Proceedings of the 2002 International Conference on Supercomputing (16th ICS'02), ACM, 2002, pp. 84-95.
  13. S. Merugu, S. Srinivasan, E.W. Zegura, Adding structure to unstructured peer-to-peer networks: the role of overlay topology, in: NGC 2003, and ICQT 2003, Proceedings, Lecture Notes in Computer Science, vol. 2816, Springer, 2003, pp. 83-94.
  14. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, U. Alon, Network motifs: simple building blocks of complex networks, Science 298 (5594) (2002) 824-827.
  15. C. Papadakis, P. Fragopoulou, E. Athanasopoulos, M.D. Dikaiakos, A. Labrinidis, E. Markatos, A feedback-based approach to reduce duplicate messages in unstructured peer-to-peer networks, in: Integrated Research in GRID Computing, 2007.
  16. M. Portmann, A. Seneviratne, Cost-effective broadcast for fully decentralized peer-to-peer networks, Computer Communications 26 (11) (2003) 1159-1167.
  17. S. Saroiu, P.K. Gummadi, S.D. Gribble, A measurement study of peer- to-peer file sharing systems, Tech. rep., July 23 2002.
  18. S. Shaw, J. Chandra, N. Ganguly, HPC5: an efficient topology generation mechanism for gnutella networks, in: 10th International Conference on Distributed Computing and Networking -ICDCN, Hyderabad, 2009.
  19. D. Stutzbach, R. Rejaie, Capturing accurate snapshots of the gnutella network, IEEE INFOCOM, 2005, pp. 2825-2830.
  20. D. Stutzbach, R. Rejaie, Understanding churn in peer-to-peer networks, in: IMC '06: Proceedings of the 6th ACM SIGCOMM conference on Internet Measurement, ACM, New York, NY, USA, 2006, pp. 189-202.
  21. D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, W. Willinger, On unbiased sampling for unstructured peer-to-peer networks, IEEE/ACM Transactions on Networking 17 (2) (2009) 377-390.
  22. D. Stutzbach, R. Rejaie, S. Sen, Characterizing unstructured overlay topologies in modern P2P file-sharing systems., in: Internet Measurment Conference, USENIX Association, 2005, pp. 49-62.
  23. D. Gavidia, S. Voulgaris, M. van Steen, CYCLON: inexpensive membership management for unstructured P2P overlays, Journal of Networks and System Management 13 (2) (2005) 197-217.
  24. S. Zhao, D. Stutzbach, R. Rejaie, Characterizing files in the modern gnutella network: a measurement study, in: Proceedings of SPIE/ACM Multimedia Computing and Networking, vol. 6071, 2006.
  25. Z. Zhenzhou, K. Panos, B. Spiridon, DCMP: a distributed cycle minimization protocol for peer-to-peer networks, in: Parallel and Distributed Systems, IEEE Transactions, vol. 19, IEEE, 2008, pp. 363- 377.