Principal component analysis of link prediction scores to propose a binary classification model for the betweenness of edges in complex networks

Meghanathan, Natarajan

doi:10.1007/s13278-025-01438-7

Principal component analysis of link prediction scores to propose a binary classification model for the betweenness of edges in complex networks

Original Article
Open access
Published: 10 March 2025

Volume 15, article number 8, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Social Network Analysis and Mining Aims and scope Submit manuscript

Principal component analysis of link prediction scores to propose a binary classification model for the betweenness of edges in complex networks

Download PDF

Natarajan Meghanathan¹

880 Accesses
Explore all metrics

Abstract

The problem of quantifying edge betweenness in complex networks is computationally-intensive. We propose a novel idea of rather simply classifying the edges as those of high-betweenness or low-betweenness. We make use of the tendency of neighborhood-based link prediction techniques to highly score node pairs (including those connected through edges) that are already part of a dense sub graph. We compute link prediction scores (using four different neighborhood link prediction techniques) for all the node pairs in the network. Our hypothesis is the following: edges with high link prediction scores are likely to comprise of end vertices that part of the same dense sub graph, and hence are expected to be of low-betweenness. On the other hand, edges with relatively lower link prediction scores are likely to connect two different sub graphs (which would otherwise may not be even connected) and hence could be expected to be of high-betweenness. We conduct principal component analysis (PCA) of a link prediction scores dataset for all the node pairs in a network and compute a weighted average score (using the variances of the PCs as weights) for each node pair (including the edges in the network). We propose to categorize edges with negative values for the PCs-based weighted average link prediction score to belong to the class of "high-betweenness" edges and vice-versa. We evaluate the accuracy of our predictions using correlation studies.

Node-betweenness-based principal component model for core–periphery structural analysis of complex networks

Article 16 March 2026

Link Prediction in Complex Networks: An Empirical Review

A node representation learning approach for link prediction in social networks using game theory and K-core decomposition

Article 08 October 2019

1 Introduction

The betweeness of an edge (Freeman 1977; Girvan and Newman 2002) captures the participation of the edge in shortest path communication between node pairs in a network. The problem of quantifying the betweenness of edges (in the form of a centrality metric called the Edge Betweenness Centrality: EBWC (Girvan and Newman 2002) in a complex network is a computationally-intensive one, as it would involve computing the shortest path trees (Brandes 2001; Cormen et al. July 2009) rooted at every node in the network. Algorithms (e.g., Girvan and Newman 2002) for computing the EBWC metric tend to be synchronous in nature (i.e., it is not possible to compute the EBWC of just a particular edge or a group of edges in the network; we could either determine the EBWC for all the edges in the network or for none) and require global knowledge (i.e., the betweenness of edges cannot be determined using just the local knowledge: such as the neighborhood information of the nodes). The EBWC values are expected to be relatively lower for edges within a dense sub graph of nodes (e.g., cliques, k-cores, communities, etc.) compared to the EBWC values for edges that bridge two different densely connected sub graphs. Likewise, the EBWC values for edges that connect a peripheral node (which would otherwise be mostly disconnected from the rest of the network) (Meghanathan 2023a) to a core node (nodes typically of a larger degree, located at the center of the network) are expected to be larger than the EBWC values of edges that connect two core nodes within the same dense sub graph. Throughout the paper, the terms 'node' and 'vertex', 'edge' and 'link', 'network' and 'graph', 'measure' and 'metric', 'predicted' and 'categorized' are used interchangeably. They mean the same.

In this paper, we propose a novel idea of rather just classifying the edges as high-betweenness edges and low-betweenness edges using a principal component analysis (PCA) and link prediction techniques-based scoring mechanism that form the basis for this classification. Our edge betweenness (EBW) binary classification model and the underlying scoring mechanism stem from the following hypothesis: link prediction techniques tend to highly score edges of lower betweenness, and the link prediction scores tend to be relatively lower for edges of higher betweenness. Our hypothesis exploits the tendency of link prediction techniques to highly score node pairs that if connected (in the form of an edge) would further increase the density of an already dense sub graph. Link prediction techniques tend to lowly score node pairs that are not in the same sub graph and are separated by shortest paths of lengths greater than 2. We observe the link prediction techniques to highly score the end vertices of edges that are already part of a graph: i.e., edges whose end vertices are within the same dense sub graph are scored highly compared to edges connecting two different sub graphs that would otherwise be either not connected or connected through a multi-hop path of length 2 or more.

Our proposed methodology is briefly explained as follows: For a given network of n nodes and m edges, we first build a n*(n−1)/2 × 4 dataset of records, where each record corresponds to a node pair (including the edges in the network) and '4' is the number of features, corresponding to the link prediction techniques (Preferential Attachment Index Barabási and Albert 1999: PAI; Adamic Adar Index Adamic and Adar 2003: AAI; Jaccard Index Jaccard 1912): JAI and Salton Index (Salton 1983): SAI) used. The entries in a record for a node pair are the link prediction scores for the node pair determined per each of these four link prediction techniques. We next conduct principal component analysis (PCA (Jolliffe 2002)) of the n(n−1)/2 × 4 link prediction scores dataset to compute a comprehensive link prediction score for each record (node pair) that captures all the underlying score variations among the link prediction techniques within the record as well as across records. In this pursuit, we compute a weighted average score (referred to as Average Link Prediction: $LP_{Avg}^{PC}$ score) for each node pair as the weighted average of the entries for the node pair in the four principal components (PCs), with the variances of the entries in the PCs used as the weights.

We observe the $LP_{Avg}^{PC}$ scores to be positive and larger for edges that are within a dense sub graph (i.e., both the end vertices of the edges are within the same sub graph), whereas the $LP_{Avg}^{PC}$ scores were negative/lower for edges that are not part of any sub graph, but could connect two different sub graphs (inclusive of the scenarios of just a node and a sub graph). True to our hypothesis, we observe a negative correlation (with respect Spearman's ranking correlation (Strang 2006) and a larger probability of observing discordant pairs of edges) between the $LP_{Avg}^{PC}$ scores of the edges and their EBWC values. We thence propose that edges with positive $LP_{Avg}^{PC}$ scores be categorized to the class of edges with low EBW values and edges with negative $LP_{Avg}^{PC}$ scores be categorized to the class of edges with high EBW values. We quantitatively validate such a binary classification of edges in a network through five different measures: (1) the fraction of high-EBW edges to that of the total number of edges in the high-EBW class and low-EBW class; (2) the ratio of the average EBWC values of the high-EBW class of edges and the average EBWC values of the low-EBW class of edges; (3) the fraction (ratio) of edges in the high-EBW class that incur a lower EBWC value compared to the EBWC of edges in the low-EBW class to that of the product of the number of edges in the two classes; (4) the Spearman's ranking-based correlation of the edges with respect to their EBWC values and their $LP_{Avg}^{PC}$ scores and (5) the discordant probability: ratio of the number of discordant pairs of edges to that of the total number of concordant and discordant pairs of edges with respect to their EBWC values and their $LP_{Avg}^{PC}$ scores. We validate our proposed mechanism over a suite of 40 different real-world networks, ranging from random networks to scale-free networks.

We envision several applications that could benefit from a binary EBW classification model as well as the underlying PCA and link prediction techniques-based scoring mechanism. We list a few of them here:

1.
Core-periphery analysis: We opine that edges with positive, very large $LP_{Avg}^{PC}$ scores could be branded as core-core (Meghanathan 2023a) edges, connecting two nodes within the core of the network and edges with negative, very low $LP_{Avg}^{PC}$ scores could be branded as bridge edges that connect the peripheral nodes to the core nodes or two non-overlapping blocks of core nodes (Meghanathan 2023a).
2.
Community detection: Several community detection algorithms (like the Girvan-Newman algorithm (Girvan and Newman 2002)) just need a ranking of the edges based on their EBWC values and not the actual EBWC values. The hypothesis behind these community detection algorithms is that edges with larger EBWC are more likely bridge edges, connecting two different communities. The negative correlation between the $LP_{Avg}^{PC}$ and EBWC values could be used as the basis and (instead of computing the EBWC values for the edges) one could rank the edges in the increasing order of their $LP_{Avg}^{PC}$ values in lieu of a ranking of the edges in the decreasing order of their EBWC values. Also, two edges could be relatively compared with respect to $LP_{Avg}^{PC}$ and EBWC. For any two edges e₁ and e₂, if $LP_{Avg}^{PC}$(e₁) < $LP_{Avg}^{PC}$(e₂), we could say with a very high probability (for example: 0.75 for the toy example graph in Sect. 2.5) the EBWC(e₁) ≥ EBWC(e₂), and vice-versa.
3.
Targeted edge addition for evaluating community robustness: In (Tian and Moriano 2023), the authors observe targeted edge addition (compared to random edge addition) to effectively disrupt the robustness of communities from staying distinct, withstanding the perturbation. We claim that node pairs with negative $LP_{Avg}^{PC}$ scores could be preferred for targeted edge addition as edges involving such node pairs are likely to be of larger EBWC and their inclusion could potentially bridge two different communities into one. Such targeted edge additions could then make it harder for the community detection algorithms to identify communities (in the perturbed networks) similar to the ones that existed before the edge additions.
4.
Modular link prediction: The node pairs that incur the largest $LP_{Avg}^{PC}$ scores, but not yet connected, could be predicted as the node pairs that could be connected in the form of edges whose inclusion is very likely to increase the modularity of the communities identified by any community detection algorithm.
5.
Small-world evolution: If we were to evolve a given network to a small-world network with the inclusion of a few more links, we could consider connecting the node pairs with the lowest $LP_{Avg}^{PC}$ scores; such edges are more likely to be connecting two non-overlapping sub graphs or node and a sub graph of the network, which would until then be only reachable to each other through multi-hop path(s) of length 2 or more.

The rest of the paper is organized as follows: Sect. 2 proposes our PCA/link prediction-based EBW classification model and illustrates the mechanism using a toy example graph. Section 3 presents the results of applying the proposed EBW classification model on a suite of 40 real-world networks and analyzes the performance results with respect to the five different measures, mentioned earlier. Section 4 discusses related work that focuses on the use of edge betwenness for link prediction. Section 5 concludes the paper, identifies the limitations of the proposed work and outlines plans for overcoming these limitations as part of future work as well as outlines potential extensions.

2 PCA and link prediction-based binary classification model for edge betweenness

In this section, we present our proposed principal component analysis (PCA) and link prediction-based binary classification model to categorize edges as either low-betweenness or high-betweenness edges in a network. This section has five sub sections. We use a toy example graph (shown in Fig. 1, along with the edge betweenness centrality values) to illustrate the working of the proposed binary classification model for edge betweenness (referred to as the PCA_LP: EBW model). Section 2.1 introduces edge betweenness centrality (EBWC) and briefly describes the algorithmic procedure to determine the EBWC values for the edges in a graph. Section 2.2 introduces the neighborhood overlap (NOVER) edge centrality measure and the results of its correlation study with EBWC that motivated us to consider neighborhood-based link prediction techniques for developing an EBW binary classification model. Section 2.3 shows the computation of the link prediction scores for the node pairs in the toy example graph using the four different link prediction techniques (PAI, AAI, JAI and SAI) considered in this paper. Section 2.4 illustrates the execution of PCA on the link prediction dataset constructed for the toy example graph. Section 2.5 analyzes the magnitude and sign of the $LP_{Avg}^{PC}$ scores for the node pairs/edges vis-a-vis their betweenness in the graph. Section 2.6 presents an evaluation of the PCA_LP: EBW model for the toy example graph using five different metrics.

2.1 Edge betweenness (EBW)

The betweenness of an edge (simply referred to as Edge Betweenness, EBW) captures the contribution of the edge for shortest path communication (Freeman 1977; Girvan and Newman 2002) between any two nodes in the network. The betweenness of an edge is quantified in the form of a centrality metric, referred to as Edge Betweenness Centrality (EBWC) (Girvan and Newman 2002)). The BWC for an edge (Brandes 2001; Newman September 2010) is the sum of the fractions of shortest paths between any two nodes in the network going through the edge. For example: consider the edge (2, 7) in Fig. 1 and the node pair {1, 8}. There are two shortest paths (1–2–7–8 and 1–2–4–8) between nodes 1 and 8. Only one of these two shortest paths go through edge (2, 7). Hence, the betweenness fraction for the edge (2, 7) with respect to the node pair {1, 8} is 1/2. Likewise, we could compute the betweenness fractions for the edge (2, 7) with respect to all the node pairs in the network and add them up to compute the EBWC for the edge (2, 7).

The EBWC metric cannot be computed just for a single edge or a selected set of edges in the network. It needs to be synchronously computed for all the edges in the network. The standard algorithm to compute EBWC metric is the following Breadth First Search (BFS Cormen et al. 2009)-based one, briefly explained below as a sequence of three steps:

Step 1: We run BFS starting from each node in the network and determine BFS trees rooted at each node; we make a note of the level numbers (distance of a node from the root node of the BFS tree) of the nodes in each of the BFS trees. The root of a BFS tree is said to be at level 0.
Step 2: In each of the BFS trees, we determine the number of shortest paths from the root node to all the nodes in the tree. The number of shortest paths from the root node to itself is 1. The number of shortest paths from the root node to a node i at level k (k > 0) in a BFS tree is the sum of the number of shortest paths from the root node to the predecessors of node i at level k−1.
Step 3: In each of the BFS trees, we simulate the process of sending one unit of information from every node (other than the root) to the root node of the BFS tree. The transmission starts from the node at the bottommost level in the BFS tree. A node at the bottommost level of the BFS tree divides its one unit of information proportionally depending on the number of immediate predecessors (nodes at the immediately earlier level) in the BFS tree and sends the fraction of information to these predecessor nodes. A node at level other than the bottommost level aggregates the flows received from all its immediate downstream nodes and adds its own one unit of information to the aggregate and proceeds to send the aggregated units of information proportionally (as described earlier) to its immediate predecessor nodes. The root node of a BFS tree accumulates the units of information received from its immediate downstream nodes and this would total to n−1 for a network of n nodes. We keep track of the amount of flow sent in each of the edges in the BFS trees. The EBWC for an edge is the sum of the flows sent on the edge across all the BFS trees divided by 2 (for an undirected graph).

Figure 2 illustrates the above three steps for two BFS trees in the toy example graph, one rooted at node 1 and another rooted at node 9. We have to do such computations for the BFS trees rooted at all the nodes in the network, add up the flows sent across the edges in each of the BFS trees and compute the EBWC of the edges. Figure 1 shows the EBWC values for all the edges in the toy example graph. The BFS algorithm is of time complexity Θ(n+m) for a graph of n nodes and m edges. As part of Step 1, we have to run the BFS algorithm a total of 'n' times starting from each of the nodes to determine the 'n' BFS trees. Further, we have to traverse the 'm' edges of the graph in each of Steps 2 and 3, leading to an overall time complexity of Θ(n*(n+m+m+m)) = Θ(n² +n*m. The EBWC algorithm (Girvan and Newman 2002; Brandes 2001) is thus computationally-intensive. This motivated us to explore developing a binary classification model that would simply categorize the edges as high-EBW and low-EBW edges without going through the EBWC computations.

2.2 Theoretical basis and hypothesis

The motivation for us to consider neighborhood-based link prediction techniques (presented in Sect. 2.3) for developing an EBW binary classification model stems from observations made in an earlier research (Meghanathan and Yang 2019) that attempts to explore the correlation between neighborhood overlap (NOVER, a local-knowledge based asynchronously computable edge centrality metric) and EBWC for a suite of 47 real-world networks of diverse degree distributions. The NOVER score for an edge (u, v) is the ratio of intersection of the neighborhood sets of u and v and the union of the neighborhood sets of u and v, excluding themselves. We notice a moderate-strong negative rank-based correlation (measured through the Spearman's correlation coefficient) for several real-world networks. The median of the Spearman's, Kendall's and Pearson's correlation coefficients for NOVER and EBWC were observed to be −0.715, −0.564 and −0.475 respectively.

A fundamental reasoning we attribute to the NOVER vs. EBWC negative correlation is that if the vertices adjacent to the end vertices (node pairs) of an edge (u, v) do not prefer to use the edge for shortest path communication (tends to happen if the node pair (u, v) incurs a larger NOVER score; i.e., the adjacent vertices of u and v are directly connected to each other and would not need to go through the edge (u, v) for shortest path communication), then the rest of the vertices in the network are not likely to use the edge for shortest path communication. Nevertheless, as the median values for the Kendall's and Pearson's NOVER vs. EBWC correlation coefficients are only moderate (around −0.50), we opine that the NOVER metric (whose values range from 0.0 to 1.0) alone could not be used as the basis to categorize edges as those of high-EBW and low-EBW classes.

In this research, we explore the use of a suite of link-level metrics that resonate with NOVER vis-a-vis their relationship to EBWC. Neighborhood-based link prediction techniques tend to highly score node pairs (i.e., vouch for these node pairs to be connected through edges) whose neighborhood is significantly shared (i.e., the node pairs incur a larger NOVER score) by both the nodes forming the pair and vice-versa. The fundamentally negative relationship that evidently appears to exist between NOVER and EBWC (validated in Meghanathan and Yang 2019) through correlation studies) forms the theoretical basis for us to explore the use of neighborhood-based link prediction techniques for developing a binary classification model for edge betweeenness. The shared neighborhood-based link prediction scores for node pairs located within a community are likely to be larger than the link prediction scores for node pairs located between two different communities. On the other hand, the EBWC values for edges within a community are likely to be lower than the EBWC values for edges connecting two different communities (Girvan and Newman 2002), especially if the latter are bridge edges. We hypothesize that node pairs with larger link prediction scores (and would in turn incur larger NOVER scores), even if connected in the form of edges, are likely to incur a lower EBWC.

We propose to conduct principal component analysis (PCA) on a dataset of the neighborhood-based link prediction scores for the node pairs in a network and utilize the summation property of the principal components (i.e., the sum of the entries in a principal component is zero) to delineate the high-EBW edges (whose entries in the principal components would be negative, owing to their lower link prediction scores in the underlying dataset used to extract the principal components) from the low-EBW edges (whose entries in the principal components would be positive, owing to their relatively larger link prediction scores). As the notion of shared neighborhood is embedded in the formulation of the link prediction techniques, we do not use NOVER as one of the features of the underlying dataset on which we conduct PCA.

2.3 Link Prediction Techniques

We choose to use the computationally-light local neighborhood knowledge-based link prediction techniques (Bojanowski and Chrol 2020) rather than the computationally-heavy global knowledge-based link prediction techniques, though the latter could be slightly more accurate in their prediction (Bojanowski and Chrol 2020). Though typically run for node pairs not yet connected with an edge, the link prediction techniques could essentially quantify the extent there could be a link between any two nodes in the network (irrespective of whether or not the two nodes are already connected with an edge). We make use of this characteristic of the link prediction techniques and compute the link prediction scores for any two nodes in a network. We use the following four local knowledge-based link prediction techniques (the names of these techniques end with the word 'index' to indicate they seek to quantify the extent there could be a link between two nodes) to build the link prediction scores dataset (see Fig. 3) that will be subjected to PCA. Let N_u and N_v represent the sets of neighbors of vertices u and v.

(i)
Preferential Attachment Index (PAI): The PAI score for a node pair (u, v) is computed as the product of their degrees |N_u| and |N_v|. Thus, PAI tends to prefer connecting two nodes of larger degrees.
$$ PAI(u,v) = \left| {N_{u} } \right|*\left| {N_{v} } \right| $$
(ii)
Adamic Adar Index (AAI): The AAI score for a node pair (u, v) is computed as follows, based on the degrees of their common neighbors. Per the formulation, AAI tends to prefer connecting two nodes that have common neighbors, but these common neighbors (the more the better) should be of lower degree. If a node pair has no common neighbors, its AAI score is 0.
$$ AAI(u,v) = \sum\limits_{{w \in N_{u} \cap N_{v} }} {\frac{1}{{|N_{w} |}}} $$
(iii)
Jaccard Index (JAI): The JAI score for a node pair (u, v) with the set of neighbors N_u and N_v is the ratio of the cardinalities of the intersection of their neighborhood sets to the union of their neighborhood sets. Note that N_u $\cup$ N_v would exclude u and v.
$$ JAI(u,v) = \frac{{\left| {N_{u} \cap N_{v} } \right|}}{{\left| {N_{u} \cup N_{v} } \right|}} $$
(iv)
Salton Index (SAI): The SAI score for a node pair (u, v) is the ratio of the cardinality of the intersection of their neighborhood sets to that of the square root of the product of their node degrees.
$$ SAI(u,v) = \frac{{\left| {N_{u} \cap N_{v} } \right|}}{{\sqrt {\left| {N_{u} } \right|*\left| {N_{v} } \right|} }} $$

2.4 Principal component analysis of the link prediction scores dataset

We now subject the n(n−1)/2 × 4 link prediction scores dataset to principal component analysis (PCA (Jolliffe 2002)), where n is the number of nodes in the network and 4 is the number of features (the number of link prediction techniques) used to generate the dataset. Note that for a network of n nodes, the number of pairs of nodes is n(n−1)/2, which includes the 'm' edges (i.e., m ≤ n(n−1)/2) in the network as well. Figure 4 presents the pseudo code for executing PCA on the link prediction scores dataset and computing the PCs-based weighted average link prediction scores $LP_{Avg}^{PC}$ for each node pair. Typically, we observe the PC (referred to as the dominating PC) with the largest variance to capture at least 50% of the variances across all the four PCs. Also, since the sum of the entries in a PC is 0, the entries (corresponding to the node pairs, in this case) in a PC are expected to be both positive or negative. A node pair is likely to incur positive entry in a PC, especially in the dominating PC, if the link prediction scores (feature values) for the node pair in the underlying dataset are larger, and vice-versa.

2.5 Analysis of the $LP_{Avg}^{PC}$ scores for the node pairs/edges

Figure 5 presents the $LP_{Avg}^{PC}$ scores for all the 36 node pairs of the 9-node toy example graph. We observe the categorization of the edges to one of the two classes (low-EBW or high-EBW) to accurately reflect the topological position of the edges. For example, the edge (1, 2) is a bridge edge connecting the two otherwise disconnected sub graphs (1, 3, 5, 6, 9) and (2, 4, 7, 8); it rightly earns a $LP_{Avg}^{PC}$ score of −1.47, reflective of the highest EBWC 20.0 (see Fig. 1) incurred for this edge. If we were to remove this bridge edge, the graph would get disconnected to two components (communities): a strategy adopted by certain community detection algorithms, like the Girvan-Newman (Girvan and Newman 2002). On the other hand, edges (1, 5) and (4, 7) incurred the largest $LP_{Avg}^{PC}$ scores of 2.23/2.24 that is reflective of their position well within the sub graphs as well as a relatively larger degree for their end vertices (compared to the rest of the vertices) within their respective sub graphs; the EBWC values for edges (1, 5) and (4, 7) are 5.0 and 1.0 respectively, relatively much lower than that of the bridge edge (1, 2). In Fig. 5, we highlight the node pairs that are also the edges in the graph: the edges with negative $LP_{Avg}^{PC}$ scores are highlighted in green and categorized as those of high edge betweenness (class: high-EBW); whereas, the edges with positive $LP_{Avg}^{PC}$ scores are highlighted in pink and categorized as those of low edge betweenness (class: low-EBW). Records that are not highlighted in Fig. 5 correspond to node pairs that are not connected by an edge in the graph.

While analyzing the $LP_{Avg}^{PC}$ scores for node pairs that are not yet connected in the form of edges, we observe node pair (2, 8) to incur the largest $LP_{Avg}^{PC}$ score of 2.31, as the inclusion of the edge (2, 8) would make the sub graph (2, 4, 7, 8) a 4-vertex clique: a perfect scenario for increasing the modularity score of the community (2, 4, 7, 8). Likewise, the inclusion of edges for node pairs (3, 9) and (6, 9): with $LP_{Avg}^{PC}$ score of 2.07 for each would basically bring in node 9 to become part of a denser graph along with nodes 1, 3, 5 and 6; the inclusion of edges (3, 9) and (6, 9) could only increase the modularity of the community (1, 3, 5, 6, 9). The above observations corroborate our earlier assertion that the high $LP_{Avg}^{PC}$ node pairs could be considered as candidates for edge inclusion in pursuit of modular link prediction.

On the other hand, if we were to evolve the toy example graph of Fig. 1 to a small-world network, we need to connect two far away nodes with an edge so that the average shortest path length between any two nodes in the network would reduce (a characteristic of small-world networks). We observe several node pairs (5, 4), (5, 7), (1, 8), (3, 4), (3, 7), (6, 4) and (6, 7) to be candidates (their $LP_{Avg}^{PC}$ scores are in a low, narrow range of −1.38 to −1.28) for inclusion to transform the toy example graph to a small-world network. If we were to break the tie using the $LP_{Avg}^{PC}$ scores for the node pairs, we will go for either (5, 4) or (5, 7), both incurring the lowest $LP_{Avg}^{PC}$ score of −1.38 among the node pairs not yet connected by an edge in the toy example graph. Inclusion of either of these two edges would definitely reduce the hop counts of the shortest paths from node 8 in the sub graph (2, 4, 7, 8) to all the 5 vertices in the sub graph (1, 3, 5, 6, 9), justifying the utility of the node pairs with the lowest $LP_{Avg}^{PC}$ scores for inclusion to evolve a given network to a small-world network. Note that there is no need to run any shortest path algorithm to identify such node pairs. The absence of common neighborhood contributes to low link prediction scores for these node pairs across for three (AAI, JAI and SAI) of the four link prediction techniques used in this paper and their $LP_{Avg}^{PC}$ scores accordingly gets lower/negative.

2.6 Evaluation of the PCA_LP: EBW binary classification model

We evaluate the performance/effectiveness of the PCA_LP: EBW binary classification model through five different metrics. The calculation of these metrics and the assessment of the effectiveness of the model are illustrated in this sub section through the EBWC values and $LP_{Avg}^{PC}$ scores observed for the toy example graph.

(i)
Fraction of edges in the high-EBW class: The fraction of the edges predicted to be in the high-EBW class is calculated as the ratio of the number of edges predicted to be in the high-EBW class to the sum of the number of edges predicted to be in the high-EBW class and low-EBW class. We hypothesize this to be an influential metric that could impact the values for one or more of the other performance metrics (see the logarithmic curve fit in Fig. 11 for the real-world networks and the ensuing explanation). For the toy example graph, the number of edges predicted to be in the high-EBW class and low-EBW class are respectively 2 and 12, resulting in the fraction of edges predicted to be in the high-EBW class to be 2/(2 + 12) = 0.1428.
(ii)
Average EBWC ratio: We calculate the average EBWC values for the edges categorized in the class: high-EBW and class low-EBW. We then compute the ratio of these average EBWC values for the high-EBW class to the low-EBW class. We expect the ratio to be appreciably greater than 1.0 for the binary classification model to be considered effective in what it is supposed to do. For the toy example graph of Fig. 1, the average EBWC value for the high-EBW edges (1, 2) and (1, 9) is (20.0 + 6.0)/2 = 13.0; the average EBWC value for the low-EBW edges is the average of EBWC values of the remaining 12 edges: 4.08. Hence, the Average EBWC ratio incurred by the PCA_LP: EBW binary classification model for the toy example graph is 13.0/4.08 = 3.18, much greater than 1.0, exemplifying the effectiveness of the model.
(iii)
EBW class concordance index: In order for the PCA_LP: EBW binary classification model to be considered 100% concordant with respect to the classification, we expect the EBWC value for every edge categorized to be in the high-EBW class to be greater than or equal to the EBWC values of the edges categorized to be in the low-EBW class. We identify the pairs of edges in the two classes that are concordant, per the above criterion and compute the EBW class concordance index as the fraction (ratio) of the number of concordant pairs of edges in the two classes to the total number of pairs of edges in the two classes. For the toy example graph, we observe the EBWC of edge (1, 2) in the high-EBW class to be greater than the EBWC values of all the 12 edges in the low-EBW class, accounting for 1*12 = 12 concordant pairs of edges; plus, the EBWC of edge (1, 9) in the high-EBW class is greater than the EBWC values of 10 of the 12 edges in the low-EBW class, totaling to 12 + 10 = 22 concordant pairs of edges. The total number of pairs of edges across the two classes is (2 edges in the high-EBW class) * (12 edges in the low-EBW class) = 24 pairs of edges. Hence, the EBW class concordance index for the toy example graph is 22/24 = 0.92.
(iv)
Spearman's $LP_{Avg}^{PC}$ vs. EBWC rank-based correlation coefficient: As edges incurring negative $LP_{Avg}^{PC}$ values are categorized to the high-EBW class and edges incurring positive $LP_{Avg}^{PC}$ values are categorized to the low-EBW class, we expect an inverse correlation between the $LP_{Avg}^{PC}$ values of the edges computed per the PCA_LP: EBW model and their actual EBWC values. Figure 6 illustrates the computation of the Spearman's rank-based correlation coefficient (Strang 2006) between these two measures. We give lower tentative rank numbers for values ranging from smaller to higher for both the measures. The final rank number for competing edges (that have a tie in their values with respect to a measure) is the average of their corresponding tentative rank numbers. The formula to compute the rank-based correlation coefficient is $1 - \frac{{6*d^{2} }}{{m(m^{2} - 1)}}$, where d is the difference in the final rank numbers of the edges with respect to the $LP_{Avg}^{PC}$ values and the EBWC values. From Fig. 6, d = 773.5. The Spearman's rank-based correlation coefficient is $1 - \frac{6*773.5}{{14(14^{2} - 1)}} = - 0.70$, a lower negative value: indicative of strong negative correlation between the $LP_{Avg}^{PC}$ values and the EBWC measure.
Fig. 6
Computation of the Spearman's Rank-based Correlation Coefficient between the $LP_{Avg}^{PC}$ Scores for the Edges and their EBWC Values
Full size image
(v)
$LP_{Avg}^{PC}$ vs. EBWC discordant probability: For any two edges e₁ and e₂, with a very high probability (referred to as the discordant probability), we expect the $LP_{Avg}^{PC}$ and EBWC values of the two edges to be inversely related (a measure of discordance) rather than similarly related (a measure of concordance). We propose to count the number of concordant and discordant pairs of edges in the network and measure the discordant probability as the fraction of the discordant pairs of edges to the total number of pairs of edges. The criteria for counting the concordant and discordant pairs of edges are as follows:
- if ($LP_{Avg}^{PC}$(e₁) < $LP_{Avg}^{PC}$(e₂) and EBWC(e₁) < EBWC(e₂)).
- the two edges are concordant.
- else if ($LP_{Avg}^{PC}$(e₁) > $LP_{Avg}^{PC}$(e₂) and EBWC(e₁) > EBWC(e₂)).
- the two edges are concordant.
- else if ($LP_{Avg}^{PC}$(e₁) = = $LP_{Avg}^{PC}$(e₂) and EBWC(e₁) = = EBWC(e₂)).
- the two edges are concordant.
- else if ($LP_{Avg}^{PC}$(e₁) < $LP_{Avg}^{PC}$(e₂) and EBWC(e₁) ≥ EBWC(e₂)).
- the two edges are discordant.
- else if ($LP_{Avg}^{PC}$(e₁) ≥ $LP_{Avg}^{PC}$(e₂) and EBWC(e₁) < EBWC(e₂)).
- the two edges are discordant.

Since there are a total of 14*(14–1)/2 = 91 pairs of edges that need to be compared to count the number of concordant and discordant pairs; we simply report the final result here, rather than presenting the pair-wise details. Of the 91 pairs of edges, we find 22 concordant pairs and 69 discordant pairs. We observe about 75% (discordant probability = 69/91 ~ 0.75) of the edge pairs to be discordant with respect to the two measures $LP_{Avg}^{PC}$ and EBWC, vindicating our earlier assertion of a strong negative correlation between them. That is, for the toy example graph, if the $LP_{Avg}^{PC}$ of an edge e₁ is greater than the $LP_{Avg}^{PC}$ of an edge e₂, then more likely (with a probability of about 0.75), the EBWC of edge e₁ is less than or equal to the EBWC of edge e₂, and vice-versa.

2.7 Evaluation on a toy example graph without bridge edges

We now present an evaluation of the proposed PCA_LP: EBW binary classification model on a toy example graph (see Fig. 7) that does not have any bridge edges (unlike the example graph Fig. 1, wherein the edge (1, 2) is a bridge edge), but have one or more hub nodes. Node 5 (of degree 5) serves as the high-degree hub node without which the graph would disconnect to two components. Figure 7 presents the EBWC and $LP_{Avg}^{PC}$ scores (calculated on the basis of the four link prediction techniques mentioned in Sect. 2.2) for all the edges in the graph. Per the PCA_LP: EBW model, 6 edges (with a positive $LP_{Avg}^{PC}$ score) of a total of 16 edges get classified as high-EBW edges (highlighted in green in Fig. 7) and the other 10 edges (with a negative $LP_{Avg}^{PC}$ score, highlighted in pink in Fig. 7) get classified as low-EBW edges. The list of edges in Fig. 7 for each of the two classes are sorted in the decreasing order of the EBWC values to facilitate a visual/easier calculation of some of the performance metrics proposed in Sect. 2.5.

The fraction of the edges in the High-EBW class is 6/16 = 0.375 (more than the fraction of High-EBW edges observed for the graph in Fig. 1 with bridge edge). Nevertheless, three of the six High-EBW edges in Fig. 7 are those connected with the hub node (node 5); a similar observation could also be made for the toy example graph of Fig. 1, wherein both the bridge edges are incident on the high-degree hub node (node 1). The average EBWC values for the edges in the High-EBW and Low-EBW classes are 7.47 and 4.02 respectively, resulting in the average EBWC ratio to be 7.47/4.02 = 1.86. We observe a total of 50 pairs of edges (out of 6*10 = 60 pairs of edges) across the two classes to be concordant: i.e., the EBWC of an edge in the High-EBW class is actually greater than the EBWC of an edge in the Low-EBW class. The EBW class concordance index is thus 50/60 = 0.83. Figure 7 also displays the distribution of the final rank of the edges with respect to EBWC and the $LP_{Avg}^{PC}$ scores: we observe a negative correlation between these two measures (i.e., the larger the final rank of an edge with respect to one metric, the lower the final rank of the edge with respect to the other metric, an vice-versa), and the Spearman's rank-based correlation coefficient is −0.70. Of the 16*(16–1)/2 = 120 pairs of edges possible to be considered for pair-wise comparison (see metric (v) in Sect. 2.5) with respect to EBWC and the $LP_{Avg}^{PC}$ scores, we observe 32 pairs to be concordant and 88 pairs to be discordant, resulting in the discordant probability 88/120 ~ 0.73, vindicating a strong negative correlation between these two measures. To summarize, for graphs without bridge edge(s), it appears that a relatively larger fraction of edges tend to get classified as High-EBW edges and the average EBWC ratio between the two classes of edges could be relatively lower than that of graphs with bridge edge, we observe the proposed PCA_LP: EBW binary classification model to be equally effective (as the values for the Spearman's rank-based correlation coefficient: −0.70 and −0.70, EBW class concordance index: 0.92 and 0.83 and the discordant probability: 0.75 and 0.73 are still comparable for the graphs in Figs. 1 and 7) in identifying edges with a relatively larger EBWC and tagging them as High-EBW edges for graphs without any bridge edge as well.

3 Evaluation for real-world networks

In this section, we run the PC_LP: EBW binary classification model on a suite of 40 real-world networks spread over diverse domains and whose degree distributions range from those of random networks (Erdos and Renyi 1959) to scale-free networks (Albert and Barabasi 2002). The number of nodes and edges (see in these networks ranges from [24, …, 461] and [38, …, 5972], with a median of 105 nodes and 507 edges respectively. The spectral radius ratio for node degree ($\lambda_{k}^{sp}$ Meghanathan 2014 ≥ 1.0), which is a measure of the variation in node degree ranges from 1.01 to 3.81, with a median of 1.48. Random networks exhibit a significantly lower $\lambda_{k}^{sp}$(in the vicinity of 1.0 to 1.2 Meghanathan 2017) compared to the scale-free networks.

Figure 8 presents the number of nodes and edges as well as the $\lambda_{k}^{sp}$ values of the real-world networks (net #s 1–40) listed in Figs. 9 and 10. Figure 9 (for net #s 1–20) and 10 (for net #s 21–40) present the names of the 40 real-world networks, the number of edges predicted to be in the high-EBW class and low-EBW class as well as the values incurred for the five performance metrics (introduced in Sect. 2.5) per the PC_LP: EBW model. For each of the five performance metrics, we have presented a heat map visualization of the values considered together across both Figs. 9 and 10.

Barring four real-world networks (net #s 2, 3, 10 and 31), the number of edges predicted to be in the low-EBW class is typically much larger than the number of edges predicted to be in the high-EBW class for the rest of the 36 real-world networks. The fraction of edges in the high-EBW class compared to the total number of edges ranges from 0.03 to 0.96, but with a median of 0.20. However, the average EBWC ratio (calculated as the ratio of the average EBWC of the edges in the high-EBW class and the average EBWC of the edges in the low-EBW class) is much larger than 1.0 for all the real-world networks; the average EBWC ratio values range from 1.27 to 6.02, with a median of 2.35.

An interesting observation is that as the fraction of edges predicted to be in the high-EBW class increases from 0 to 1, the average EBWC ratio of the edges in the high-EBW class vs. the low-EBW class decreases (see Fig. 11). The decrease appears to follow a logarithmic trend. If we were to fit a logarithmic function for the distribution shown in Fig. 11, we get: the Average EBWC ratio = −0.8524 * ln(fraction of edges predicted to be in the high-EBW class) + 1.1656, with R² value of 0.41. The decrease is also confirmed through the heat map visualization enabled for the two metrics in Figs. 9 and 10: the more reddish (lower values) are the entries for the fraction of edges in the high-EBW class, the more greener (larger values) are the corresponding entries in the Average EBWC ratio column; whereas, the more greenish are the entries for the fraction of edges in the high-EBW class, the lighter are the colors of the cells that show the corresponding values for the Average EBWC ratio.

Note that the fraction of edges predicted to be in the high-EBW class is calculated based on the number of edges incurring negative $LP_{Avg}^{PC}$ and positive $LP_{Avg}^{PC}$ scores per the PC_LP: EBW model. Hence, if one were to use our proposed model for a test real-world network (other than the 40 real-world networks used here), one could obtain the value for the fraction of edges in the high-EBW class by simply conducting PCA on the link prediction scores dataset for the test real-world network. One could then substitute the value for the fraction in the above logarithmic curve (fitted based on the results incurred for the 40 real-world networks) and predict the average EBWC ratio for the high-EBW edges vs. the low-EBW edges for the test real-world network without calculating the actual EBWC values for the edges in that network. If we were to do such a testing with the toy example graphs of Figs. 1 and 7 in Sect. 2, we get the fraction of edges in the high-EBW class to be 2/14 = 0.1428 and 6/16 = 0.375 respectively. Upon substituting these fraction values in the above logarithmic curve equation, we predict the average EBWC ratio for the toy example graphs of Figs. 1 and 7 to be respectively −0.8524 * ln(0.1428) + 1.1656 = 2.82 and −0.8524 * ln(0.375) + 1.1656 = 2.00, which is close enough to the actual average EBWC ratios of 3.18 and 1.86 respectively reported for the toy example graphs of Figs. 1 and 7 in Sects. 2.5 and 2.6.

The EBWC class concordance index for the 40 real-world networks ranges from 0.68 to 0.98, with a median of 0.85. Such high numbers for this metric corroborates our assertion that most of the edges predicted to be in the high-EBW class do incur (actual) EBWC values that are greater than the EBWC values of the edges predicted to be in the low-EBW class. We do not observe any dependence of the EBWC class concordance index on the fraction of edges predicted to be in the high-EBW class. Though the EBWC class concordance index could technically be in the range of [0, …, 1] and the heat map coloring for this metric in Figs. 9 and 10 are based on this range of [0, …, 1], we observe the cells recording the values for this metric to be typically greener in color, confirming the larger, narrow range of values observed for this metric vis-a-vis the other performance metrics.

The global knowledge-based Spearman's rank correlation coefficient (calculated based on the $LP_{Avg}^{PC}$ scores and the EBWC values for all the edges in the network) ranges from −0.96 to −0.40, with a median of −0.62. All the 40 real-world networks exhibit an inverse correlation between these two measures, which corroborates our hypothesis that link prediction techniques tend to highly score edges incurring larger EBWC values and vice-versa. The values incurred for the relative knowledge-based discordant probability (calculated based on the $LP_{Avg}^{PC}$ scores and EBWC values for any two edges in the network) further reaffirms the validity of our hypothesis. We observe the discordant probability values for the real-world networks to range from 0.64 to 0.92, with a median of 0.72. This implies if we were to pick any two edges in any of the 40 real-world networks, there is at least almost a 2/3rd chance/probability that if the $LP_{Avg}^{PC}$ score for edge e₁ is greater than the $LP_{Avg}^{PC}$ score for edge e₂, then the actual EBWC for edge e₁ would be less than or equal to the actual EBWC value for edge e₂ and vice-versa. This is a significant contribution of our research to the literature.

From Fig. 12, we also observe the larger the relative knowledge-based discordant probability: the more negative are the values for the global knowledge-based Spearman's rank-based correlation coefficient. This is as expected, since the more prominent is the discordant trend observed between the two metrics for any two edges in the network, the larger should be the inverse relationship observed between the two metrics for all the edges in the network. Nevertheless, the two metrics are simply linearly related, with a Pearson's correlation coefficient as high as 0.96.

With respect to the impact of the network parameters on the five performance metrics, we observe the spectral radius ratio for node degree ($\lambda_{k}^{sp}$) and the link density (ρ_link): calculated as the fraction (ratio) of the actual number of links to the maximum possible number of links between any two nodes in the network to impact the average EBWC ratio, only to a certain extent though. While $\lambda_{k}^{sp}$ positively impacts the average EBWC ratio, ρ_link negatively impacts the average EBWC ratio. The distributions in Fig. 13 show that real-world networks with larger $\lambda_{k}^{sp}$ values and lower ρ_link values are likely to exhibit a larger proportion of difference in the EBWC values of the edges predicted to be in the high-EBW class vis-a-vis the low-EBW class, The values for the other four performance metrics incurred for the 40 real-world networks appear to be independent of $\lambda_{k}^{sp}$ and ρ_link.

Using the median $\lambda_{k}^{sp}$ value of 1.48 (for the 40 real-world networks) as the cutoff, we observe 2.05 to be the median of the average EBWC ratio values for the 21 real-world networks with $\lambda_{k}^{sp}$ values less than or equal to 1.48; whereas, 2.86 is the median of the average EBWC ratio values for the remaining 19 real-world networks with $\lambda_{k}^{sp}$ values greater than 1.48. On the other hand, by using the median ρ_link value of 0.112 (for the 40 real-world networks) as the cutoff, we observe 2.67 as the median of the average EBWC ratio values for the 21 real-world networks with ρ_link values less than or equal to 0.112; whereas, 2.03 is the median of the average EBWC ratio values for the remaining 19 real-world neworks with ρ_link values greater than 0.112.

4 Related work

Link prediction is a critical concept in Network Science to analyze the growth and evolution of complex real-world networks. Link prediction techniques either use the local neighborhood knowledge or the global knowledge of the entire network. The local neighborhood knowledge-based techniques are preferred for their low computational complexity and at the same time good accuracy that is comparable to those obtained with global knowledge-based methods (Kumar et al. 2020). Very few works have used edge betweenness centrality (EBWC) in the context of link prediction. Among these: (Saxena et al. 2023) proposes to build a training set of edges through two approaches: half of the training set will be randomly chosen edges and the other half of the training set will comprise of edges of larger EBWC. Such a hybrid training set of edges are passed on to a Graph Convolution Network (GCN) for modeling and link prediction. In (Sulaimany and Amini 2021), the authors propose to compute betweenness-based link prediction score for a node pair by first computing the EBWC of the edges in the network with the inclusion of an edge between the pair of nodes and then multiplying the inverse of the EBWC of the proposed edge with the link prediction score of the neighborhood-based technique to obtain a final ranking score. The node pair that incurs the largest value for the final ranking score is chosen for inclusion. But, this approach requires the EBWC of the edges to be repeatedly computed for each node pair that is a candidate for inclusion. In (Zhang et al. 2023), the authors observed that the use of link value (like 'EBWC') coupled with the link prediction scores to train a Graph Neural Network (GNN) outperforms baseline GNNs that are trained without any link value.

The triangle closure principle is widely believed to drive the formulation of different link prediction techniques (Meghanathan 2023b), especially the local common neighborhood-based ones, for social networks. However, in Yu and Wu (2022), the authors advocate the use of a quadrilateral closure principle wherein nodes separated by shortest paths of length 3 are preferably connectible for certain social networks. They further extend this idea by proposing the use of a random walk-based diffusion to formulate the link prediction scores for any two nodes in the network. This way, the neighborhood information from nodes separated at distances greater than 2 as well are taken into account, albeit with different weights, and not completely ignored.

In (Gao et al. 2015), the authors show strong positive or negative correlations between the prediction accuracy of a suite of local neighborhood and global knowledge-based link prediction techniques vs. the network-level metrics (such as the global clustering coefficient, average degree, average shortest path length, Gini index, etc.) measured for the networks under study. For instance, the accuracy of prediction per the Preferential Attachment Index (PAI) is observed to exhibit a strong positive correlation with the Gini index metric; this is because, the Gini index metric (in a scale of 0 to 1) is likely to be high for scale-free networks that exhibit a relatively larger variation in node degree (networks with dominant hub nodes that have a very large degree compared to the rest of the nodes in the network). For such high $\lambda_{k}^{sp}$ scale-free networks, the PAI link prediction technique was observed in Gao et al. (2015) to yield a relatively higher accuracy than the other methods; on the other hand, the PAI technique performed poorly for real-world networks whose degree distribution exhibited a regular pattern (i.e., $\lambda_{k}^{sp}$ closer to 1.0; the degrees of the nodes are comparable to each other).

In (Kerrache et al. 2020), the authors propose a link prediction model that considers three different measures (that may even have a tradeoff amongst themselves): similarity of the nodes (nodes that are closer to each other are considered similar); popularity of the nodes (typically, a function of node degree; but, could be based on any node-level metric) and the degrees of the common neighborhood of the two nodes. The incorporation of the popularity measure in the formulation of the link prediction score facilitates connecting two far away popular nodes than two similar (closer) unpopular nodes. In a similar vein, the work in Naik et al. (2023) proposes to consider the sentiment similarity of users in social networks along with the community structures they are in as well as the traditional topological features of the network.

In (Nandini et al. 2024), the authors propose an enhancement to the traditional common neighborhood-based link prediction techniques by advocating to count only those common neighbors whose centrality values (per different centrality metrics like degree, betweenness, closeness, clustering coefficient, etc.) are above the average centrality value (for the particular metric) of the nodes in the entire network. However, this approach requires the computation of the global knowledge-based centrality metrics (like closeness and betweenness) even if one is using only the local common neighborhood-based link prediction techniques. A parameterized model to weigh in the common neighborhood-based link prediction scores and the centrality values of the vertices was proposed in Ahmad et al. (2020); experimental evaluation on different real-world networks suggested a larger weight (0.7 or 0.8 in a scale of 0 to 1) to be assigned to the common neighborhood-based link prediction scores compared to the centrality metrics. In a similar vein, a weighted link prediction score is proposed in Ayoub et al. (2023) involving node similarity metric and betweeness centrality metric. While a support vector machine was used as the classifier in Ahmad et al. (2020), a graph neural network was used as the classifier in Ayoub et al. (2023). For effective temporal link prediction (i.e., predicting a link in a future time instant), in a recent work (He et al. 2024), the authors propose building a layered dataset comprising of the features that represent the link prediction scores for the node pairs in static snapshots of the network in the recent past (each layer is a snapshot taken at a particular time). One would then build a machine learning model comprising of a stack of the features vectors from the consecutive temporal layers to predict links in the target layer that would correspond to the future time instant.

In (Kumari et al. 2022), the authors ran a suite of community detection algorithms on a given network to identify the different communities (dense sub graphs) of nodes. For two nodes that are not yet connected with a link, the chances of a link formation between them is considered higher if the two nodes are in the same community and lower, if they are in a different community. However, this approach requires running computationally-intensive community detection algorithms a priori before assigning link prediction scores. In (Singh et al. 2020), information diffusion between communities was proposed as the basis for identifying target links that could be retained as well as those links that could be removed from the network.

Recently (Gui 2024), the authors propose to first compute the minimum three non-trivial Eigenvectors (whose entries correspond to the individual nodes) of the Laplacian matrix of a graph (Fiedler 1973), and build a supervised n(n−1)/2 dataset with seven attributes/features using these vectors, where n(n−1)/2 corresponds to the number of node pairs. For a node pair (i, j), the attribute entries correspond to the Euclidean distance, Manhattan distance and Angular distance between any two of the three vectors as well as the degrees of the nodes i and j. The class for a record is a 1 or 0, depending whether the node pair are already connected with an edge or not. In order to predict whether two nodes could be linked or not, one can run the Random Forest algorithm (Ho 1995) on this supervised dataset.

In (Pecli et al. 2015), the authors proposed the formation of a link prediction dataset for node pairs based on the topological information about the nodes (primarily different centrality values of the nodes as well as shortest path length), reduce the dimensionality of the dataset (using techniques like PCA) and train the transformed dataset of fewer dimensions with traditional classifiers such as Naive-Bayes. k-NN and Support Vector Machines. Since the number of node pairs not connected by an edge typically outnumbers the number of node pairs connected by an edge, in Bao et al. (2013), the authors advocate the need to build a balanced training dataset that comprises a comparable number of records featuring node pairs that are connected by an edge and those that are not connected by an edge. The authors in Bao et al. (2013) observed that the use of PCA effectively balances the tradeoff between true and false positives even if the dataset has a wide range of link imbalance. In (Pech et al. 2017), the authors also observed that the use of PCA is very effective for networks of a wide range of density, especially those with large density. But, none of the above PCA-related works for link prediction advocated the computation of a comprehensive link prediction score that could be used to classify edges on the basis of their betweenness.

In (Brohl and Lehnertz 2022), the authors proposed the "nearest neighbor edge centrality (NNEC)" metric to quantitatively capture the extent to which an edge is located in the center of a network. Per (Brohl and Lehnertz 2022), the formula to compute the NNEC for an edge (u, v) in an undirected graph is (DEG[u] + DEG[v] − 2) / (DEG[u] − DEG[v] + 1), where DEG[u] and DEG[v] are the degrees of the nodes u and v. A closer look at this formula indicates the NNEC value for an edge (u, v) is likely to be larger if both the end vertices u and v of the edge are of larger degree and are comparable to each other. In a recent work (Meghanathan 2025), we observed a weak negative Spearman's rank-based correlation between the NNEC metric and the EBWC metric for a suite of 30 real-world networks (the median of the correlation coefficient values is around −0.20). Hence, the NNEC metric (though asynchronously computable and computationally-light as its computation just requires the degrees of its end vertices) cannot be a viable candidate to develop an EBW-based binary edge classification model for real-world networks.

The work presented in Long and Liu (2019) could be considered as an edge classification model with respect to the notion of "edge intensity". The intensity of an edge (v_i, v_j) is computed as a weighted average of the number of edge-distinct paths of different lengths (1, 2, 3, …, a maximum P) between the two vertices v_i and v_j. Per (Long and Liu 2019), a threshold parameter r (with values of 0.49 or 0.51) is used to classify edges as skeleton edge (those with edge intensity greater than or equal to the threshold r) or non-skeleton edge (those with edge intensity less than the threshold r). The network could be then partitioned to disjoint communities by retaining only the skeleton edges. Though the non-skeleton edges (of lower intensity) could potentially be connecting two different communities and have the potential to incur a larger EBW compared to the skeleton edges (of larger intensity) that are more likely to be connecting vertices within a community, we claim the above intensity-based edge classification model cannot be used for EBW classification for the following reasons: The intensity of an edge (especially those of the non-skeleton edges) is not a reflection of the number of vertices within the two communities that it connects. The intensity value for an edge is highly dependent on the parameter P (the maximum hop count for the edge-distinct paths) and the weights used for the edge-distinct paths of different lengths. For a given P, the intensity value for the non-skeleton edges could be the same, irrespective of the sizes of the two communities they connect. Moreover, there is no theoretical or empirical formulation proposed in Long and Liu (2019) to choose the threshold parameter r (which is actually mentioned as a plan for future work in Long and Liu (2019)) and the value used for r would highly impact the classification of edges as skeleton and non-skeleton edges.

The node betweenness centrality (NBWC) quantifies the presence of nodes in the shortest paths between any two nodes in the network. Like EBWC, NBWC is also a computationally-intensive metric that needs to be computed in a synchronous fashion involving global knowledge of the network. However, unlike EBWC, there exists a few locally computable, asynchronous, computationally-light metrics that leverage the notions of local clustering coefficient and degree centrality (like the local cluster coefficient complement-based degree centrality: LCC'DC (Meghanathan 2017), the neighborhood node propagation entropy centrality: NNPEC (Chakravarthy and Selvaraj 2023), etc.) that could serve as viable alternates for NBWC and could be used to rank the nodes in a network in lieu of NBWC. There are no such locally computable asynchronous metrics that could serve as a viable alternate for ranking the edges in a network in lieu of the EBWC. Also, as one of the end vertices of a high-EBWC edge could be of a lower NBWC (for example: in Fig. 7, the EBWC of edges (5, 9) and (5, 10) is 8.0, where 12.33 is the maximum EBWC for any edge in the graph; on the other hand, the NBWC of both the vertices 9 and 10 is 0.0), any regression model built on the basis of a dataset of the computationally-light NBW values (such as LCC'DC, NNPEC, etc.) for the end vertices of the edges may not yield a single composite metric that could serve as a viable representative for the EBWC values of the edges.

Another shortest paths-based node centrality metric that is widely used along with NBWC is the node closeness centrality (NCLC), measured as the inverse of the sum of the shortest paths distances of the node to every other node in the network. Unlike NBWC, NCLC can be computed only for a single node of interest (by running the BFS algorithm starting form the node), but with the global knowledge of the entire network. Nevertheless, if we are to rank the nodes in a network on behalf of the NCLC metric, we need to run the BFS algorithm n times, starting from each of the n nodes in the network. In (Eppstein and Wang 2004), the authors propose an approximation algorithm for NCLC, wherein a subset of l nodes (l < < n) are randomly chosen in the network and the BFS algorithm is run starting from each of these l nodes; the NCLC of any node in the network is then computed as the inverse of the sum of the shortest paths distances of the node to the l nodes. Various modifications, extensions and improvements to the above approximation algorithm are available in the literature (e.g., Saxena et al. 2019). Though both NBWC and NCLC are shortest paths-based centrality metrics, they exhibit only a moderate correlation (Meghanathan 2015) for several real-world networks; this could be attributed to the fundamental difference in their formulation (the NBWC of a node is a measure of the betweenness of the node in the shortest paths across the network; whereas, the NCLC of a node is a measure of the path lengths originating from the node of interest). Moreover, a recent work (Evans and Chen 2022) observed that the sum of the shortest path distances of a node is linearly dependent on the logarithm of the node degree, rendering the computation of NCLC to be redundant and also categorizing NCLC to be more of a neighborhood-based centrality metric. Hence, none of these NCLC approximation studies could be extended to develop a binary classification model for NBWC and EBWC.

5 Conclusions, limitations and future work

The high-level contribution of this paper is a binary classification model to classify edges as those of high-betweenness or low-betweenness using principal component analysis (PCA) of link prediction scores. We exploit the tendency of link prediction techniques (especially the widely used computationally-light, local knowledge, common neighborhood-based link prediction techniques) to highly score node pairs (including edges that are already there in the network) that are within a dense sub graph (like communities) compared to node pairs that are not part of any dense sub graphs. Such edges within a dense sub graph typically have a lower betweenness (edge betweenness centrality: EBWC) compared to edges that connect two different dense sub graphs, but not part of any sub graph. We thence propose to conduct PCA on the link prediction scores dataset generated through four common neighborhood-based techniques and compute PCs-based weighted average link prediction scores $LP_{Avg}^{PC}$ whose sign could be used to classify edges as those of high-betweenness (if the $LP_{Avg}^{PC}$ scores are negative) or low-betweenness (if the $LP_{Avg}^{PC}$ scores are positive).

We validate our hypothesis on a suite of 40 real-world network graphs and evaluate the accuracy/effectiveness of our EBW-binary classification model using five different performance metrics. For all the 40 real-world networks, we observe the average EBWC values of the edges predicted to be in the high-EBW class to be appreciably larger than the average EBWC values of the edges predicted to be in the low-EBW class. The lower the fraction of edges in the high-EBW class, the larger is the average EBWC ratio and vice-versa. We also measure the EBWC class concordance index to quantify the extent to which any edge predicted to be in the high-EBW class would indeed incur a larger EBWC than any edge predicted to be in the low-EBW class. We observe the EBWC class concordance index for the 40 real-world networks to range from 0.68 to 0.98, with a median of 0.85, irrespective of the fractions of edges in the two classes. Such high numbers for this metric corroborates our assertion that any edge predicted to be in the high-EBW class is more likely to have actually incurred a larger EBWC value than any edge predicted to the in the low-EBW class. We also observe the discordant probability for the 40 real-world networks to range from 0.64 to 0.92, with a median of 0.72. This implies that for any two edges e₁ and e₂ in any of the 40 real-world networks, there is at least almost a 2/3rd chance that if $LP_{Avg}^{PC}$(e₁) < $LP_{Avg}^{PC}$(e₂), then the EBWC(e₁) ≥ EBWC(e₂) and vice-versa. The local relative discordance between the $LP_{Avg}^{PC}$ scores and the actual EBWC values of the edges in the real-world networks also reflects at the global level; we observe a moderate to very strong negative rank-based correlation. The Spearman's rank-based correlation coefficient between the two measures for the real-world networks ranges from −0.96 to −0.40, with a median of −0.62. We also show that the $LP_{Avg}^{PC}$ scores for the node pairs that are not yet connected by an edge in the network could be used to select the node pair whose connection (in the form of an edge) would increase or at least sustain the modularity of the communities as well as to select the node pair(s) whose connection(s) would reduce the average shortest path length of the network.

We observe the Spearman's rank-based correlation coefficient values in the vicinity of −0.50,… −0.40, for real-world networks that tend to exhibit a hub-and-spoke configuration (like the airport networks #38 and #39 in Fig. 10). This could be due to the presence of a larger number of edges with NOVER scores of 0.0 that are connected to a hub/core node of a larger degree on one end and to a peripheral node of lower degree at the other end. The local neighborhood-based link prediction techniques could come up with a lower link prediction score for such node pairs (due to their NOVER score being 0.0/low and one of the end vertices have a lower degree); such edges tend to get classified in the High-EBW class per our proposed model; but, their EBWC values may not be as high as that of the EBWC values between two hub/core nodes. We see this as a potential limitation on the use of the local link prediction techniques alone for predicting the EBW class of edges. In a recent work (Meghanathan 2024), we had proposed a LCC (local clustering coefficient, a locally computable metric without global knowledge)-based iterative peeling strategy to extract one or more layers of peripheral nodes around a central/inner core. We expect that augmenting the underlying local link prediction scores-based modeling dataset with the core/peripheral classification status (a categorical feature that could be incorporated as a numerical feature using data encoding techniques such as one-hot encoding) of the end vertices of the node pairs could improve the EBW classification accuracy of our model. We plan to explore this idea as part of future work.

As part of future work, we also plan to explore the utility of node pairs whose $LP_{Avg}^{PC}$ scores are in the vicinity of 0.0 for triadic closure (Meghanathan 2023b); the inclusion of edges between such node pairs could increase the clustering coefficient of the network as well as simultaneously reduce the average shortest path length, which could be a potential strategy for transforming a given network to a small-world network. In addition, we plan to explore correlation between the $LP_{Avg}^{PC}$ scores for the node pairs/edges with that of the node betweenness centrality (NBWC) values and the local clustering coefficient-based degree centrality metric (LCC'DC (Meghanathan 2017)) values for the corresponding end vertices to see if any inferences can be arrived at on the NBWC values (without actually computing them) using the LCC'DC and $LP_{Avg}^{PC}$ measures.

Data availability

No datasets were generated or analysed during the current study.

References

Adamic LA, Adar E (2003) Friends and neighbors on the web. Soc Netw 25(3):211–230
Article Google Scholar
Ahmad I, Akhtar MU, Shahnaz A (2020) Missing link prediction using common neighbor and centrality based parameterized algorithm. Sci Rep 10(364):1–9
Google Scholar
Albert R, Barabasi A-L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(47):47–97
Article MathSciNet Google Scholar
Ayoub, J., Lotfi, D. and Hammouch, A. Link prediction using betweenness centrality and graph neural networks. Soc. Netw. Anal. Min. 13(5) (2023).
Bao, Z., Zeng, Y. and Tay, Y. C. "sonLP: Social network link prediction by PRincipal component regression. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 364–371 (2013).
Barabási A-L, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Article MathSciNet Google Scholar
Bojanowski M, Chrol B (2020) Proximity-based methods for link prediction in graphs with R Package. Res Methods 29(1):5–28
Article Google Scholar
Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177
Article Google Scholar
Brohl T, Lehnertz K (2022) A straightforward edge centrality concept derived from generalizing degree and strength. Sci Rep 12(4407):1–12
Google Scholar
Chakravarthy TS, Selvaraj L (2023) NNPEC: Neighborhood node propagation entropy centrality is a unique way to find the influential node in a complex network. Concurr. Comput.: Pract. Exp. 35(2):e7685
Article Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, Cambridge
Google Scholar
Eppstein D, Wang J (2004) Fast approximation of centrality. J. Graph Algorithms Appl. 8(1):39–45
Article MathSciNet Google Scholar
Erdos P, Renyi A (1959) On random graphs I. Publ Math 6(3–4):290–297
MathSciNet Google Scholar
Evans TS, Chen B (2022) Linking the network centrality measures closeness and degree. Commun Phys 5(172):1–11
Google Scholar
F. Gao, K. Musial, C. Cooper and S. Tsoka, "Link Prediction Methods and their Accuracy fr Different Social Networks and Network Metrics," Scientific Programming, vol. 2015, article id: 172879, pp. 1–13, 2015.
Fiedler M (1973) Algebraic connectivity of graphs. Czechoslov Math J 23(98):298–305
Article MathSciNet Google Scholar
Freeman L (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41
Article Google Scholar
Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99:7821–7826
Article MathSciNet Google Scholar
Gui C (2024) Link prediction based on spectral analysis. PLoS ONE 19(1):e0287385
Article Google Scholar
He X, Ghasemian A, Lee E, Clauset A, Mucha PJ (2024) Sequential stacking link prediction algorithms for temporal networks. Nature Commun 15(1364):1–15
Google Scholar
Ho, T. K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 278–282, Montreal, Canada, (1995).
Jaccard P (1912) The distribution of the flora in the alpine zone. New Phytol 11(2):37–50
Article Google Scholar
Jolliffe, I. T. Principal Component Analysis, 2nd Edition, Springer, October 2002.
Kerrache S, Alharbi R, Benhidour H (2020) A scalable similarity-popularity link prediction method. Sci Rep 10(6394):1–14
Google Scholar
Kumar A, Singh SS, Singh K, Biswas B (2020) Link prediction techniques, applications, and performance: A survey. Physica A 553:124289
Article MathSciNet Google Scholar
Kumari A, Behera RK, Sahoo B, Sahoo SP (2022) Prediction of link evolution using community detection in social network. Computing 104(5):1077–1098
Article Google Scholar
Long H, Liu X-W (2019) A unified community detection algorithm in large-scale complex networks. Adv Complex Syst 22(3):1950004
Article MathSciNet Google Scholar
Meghanathan N (2017) Randomness Index for complex network analysis. Soc Netw Anal Min 7(25):1–15
Google Scholar
Meghanathan N (2017) A computationally-lightweight and localized centrality metric in lieu of betweenness centrality for complex network analysis. Vietnam J. Comput. Sci. 4(1):23–38
Article Google Scholar
Meghanathan N (2024) Local clustering coefficient-based iterative peeling strategy to extract the core and peripheral layers of a network. Appl. Netw. Sci. 9(49):1–26
Google Scholar
Meghanathan N (2025) Betweenness-based ranking of edges using the principal components of the complements of local clustering coefficient and neighborhood overlap. Comput Inf Sci 18(1):1–16
Google Scholar
Meghanathan N, Yang F (2019) Correlation analysis: Edge betweenness centrality vs. neighborhood overlap. Int. J. Netw. Sci. 1(4):299–324
Article Google Scholar
Meghanathan, N. Spectral radius as a measure of variation in node degree for complex network graphs. In Proceedings of the 3rd International Conference on Digital Contents and Applications, pp. 30–33, Hainan, China, December 20–23, 2014.
Meghanathan, N. A neighborhood overlap-based binary search algorithm for edge classification to satisfy the strong triadic closure property in complex networks. In Proceedings of the CSOC 2023 Online Conference, Springer Artificial Intelligence Application in Networks and Systems, pp. 160–169 (2023).
Meghanathan, N. Correlation coefficient analysis of centrality metrics for complex network graphs. In Proceedings of the 4th Computer Science Online Conference, Springer Intelligent Systems in Cybernetics and Automation Theory: Advances in Intelligent Systems and Computing, vol. 348, pp. 11–20, April 2015.
Meghanathan N (2023) Core-intermediate-peripheral index: factor analysis of neighborhood and shortest paths-based centrality metrics. In Proceedings of the 7th Computational Methods in Systems and Software, Springer Lecture Notes in Networks and Systems, vol. 910, pp. 363–372, October 2023.
Naik D, Ramesh D, Gorojanam NB (2023) Enhanced Link prediction using sentiment attribute and community detection. J Ambient Intell Humaniz Comput 14:4157–4174
Article Google Scholar
Nandini, Y. V., Jaya Lakshmi, T., Enduri, M. K. and Sharma, H. Link prediction in complex networks using average centrality-based similarity score. Entropy, 26(6), article id: 433, pp. 1–19, 2024.
Newman, M. E. J. Networks: An Introduction¸ Oxford University Press, Oxford, UK, 1st Edition, September 2010.
Pech R, Hao D, Pan L, Cheng H, Zhou T (2017) Link Prediction via matrix completion. Europhys Lett 117(3):38002
Article Google Scholar
Pecli, A., Giovanini, B., Pacheco, C., Moreira, C., Ferreira, F., Tosta, F., Tesolin, J., Vinicius Dias, M., Filho, S., Claudia Cavalcanti, M., and Goldschmidt, R. Dimensionality reduction for supervised learning in link prediction problems. In Proceedings of the 17th International Conference on Enterprise Information Systems, pp. 295–302 (2015).
Salton, G. Introduction to Modern Information Retrieval, McGraw-Hill College, 1st Edition, January 1983.
Saxena R, Patil SP, Verma AK, Jadeja M, Vyas P, Bhateja V, Lin JC-W (2023) An efficient bet-GCN approach for link prediction. Int. J. Interact. Multimedia Artif. Intell. 8(1):38–52
Google Scholar
Saxena, A., Gera, R. Iyengar, S. R. S. A heuristic approach to estimate nodes’ closeness rank using the properties of real world networks," Soc. Netw. Anal. Min. 9(3), 2019.
Singh SS, Mishra S, Kumar A, Biswas B (2020) CLP-ID: community-based link prediction using information diffusion. Inf Sci 514:402–433
Article Google Scholar
Strang, G. Linear Algebra and its Applications, 4th Edition, Brooks Cole, Pacific Grove, CA, USA, 2006.
Sulaimany, S. and Amini, Y. Bipartite link prediction improvement using the effective utilization of edge betweenness centrality. In Proceedings of the 11th International Conference on Computer Engineering and Knowledge, pp. 200–205, Mashhad, Iran, 2021.
Tian M, Moriano P (2023) Robustness of community structure under edge addition. Phys Rev E 108(054302):1–24
MathSciNet Google Scholar
Yu J, Wu L-Y (2022) Understanding the network formation pattern for better link prediction. Physica A 600:127522
Article Google Scholar
Zhang, Z., Wu, X., Xu, H., Cui, L., Zhang, H. and Qin, W. Link value estimation based graph attention network for link prediction in complex networks. Proceedings of the 9th International Symposium on System Security, Safety, and Reliability, pp. 189–196, Hangzhou, China, 2023.

Download references

Author information

Authors and Affiliations

Jackson State University, Jackson, MS, 39217, USA
Natarajan Meghanathan

Authors

Natarajan Meghanathan
View author publications
Search author on:PubMed Google Scholar

Contributions

I am the only author for this paper. I did all the work to develop the model, obtain the results and write the paper.

Corresponding author

Correspondence to Natarajan Meghanathan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Meghanathan, N. Principal component analysis of link prediction scores to propose a binary classification model for the betweenness of edges in complex networks. Soc. Netw. Anal. Min. 15, 8 (2025). https://doi.org/10.1007/s13278-025-01438-7

Download citation

Received: 08 July 2024
Revised: 23 December 2024
Accepted: 25 December 2024
Published: 10 March 2025
Version of record: 10 March 2025
DOI: https://doi.org/10.1007/s13278-025-01438-7

Principal component analysis of link prediction scores to propose a binary classification model for the betweenness of edges in complex networks

Abstract

Similar content being viewed by others

Node-betweenness-based principal component model for core–periphery structural analysis of complex networks

Link Prediction in Complex Networks: An Empirical Review

A node representation learning approach for link prediction in social networks using game theory and K-core decomposition

1 Introduction