@article{aiyappa2024implicit,
title={Implicit degree bias in the link prediction task},
author={Rachith Aiyappa and Xin Wang and Munjung Kim and Ozgur Can Seckin and Jisung Yoon and Yong-Yeol Ahn and Sadamori Kojaku},
journal={arxiv: 2405.14985}
year={2024}
}
- Degree-corrected link prediction task
- Running your link prediction benchmarks
- Reproducing the results
This repository provides the code to generate the degree-corrected link prediction task.
pip install "git+https://[email protected]/skojaku/degree-corrected-link-prediction.git#subdirectory=libs/dclinkpred&egg=dclinkpred"or
git clone https://github.com/skojaku/degree-corrected-link-prediction.git
cd degree-corrected-link-prediction/libs/dclinkpred
pip install -e .from dclinkpred import LinkPredictionDataset
import networkx as nx
# Create a karate club graph
G = nx.karate_club_graph()
# While the graph can be networkx object, the adjacency matrix is recommended for the efficiency
G = nx.adjacency_matrix(G)
lpdata = LinkPredictionDataset(
testEdgeFraction=0.2, # 20% of the edges will be used for testing
degree_correction=True, # degree correction will be applied
negatives_per_positive=10, # 10 negative samples will be generated for each positive sample
allow_duplicatd_negatives=False, # Do not allow duplicate negative edges
)
lpdata.fit(G) # Fit the dataset
train_net, src_test, trg_test, y_test = lpdata.transform() # Transform the dataset
train_net # The network for training
src_test # The source nodes of the test edges
trg_test # The destination nodes of the test edges
y_test # The labels of the test edges, where 1 means positive and 0 means negativeWe provide all source code and data to reproduce the results in the paper. We tested the workflow under the following environment.
- OS: Ubuntu 20.04
- CUDA: 12.1
- Python: 3.11
All code are provided in the reproduction/ directory. The expected execution time varies depending on the computational resources. With our machine equipped with 8 NVIDIA V100 GPUs and 64 CPUs, the execution time for the entire workflow, including the robustness analysis, is approximately one week.
We provide the source of the network data in the edge list format at FigShare.
The edge list is a CSV file with 2 columns representing the source and destination nodes of the network.
Download the data and place it in the reproduction/data/raw directory.
We recommend using Miniforge mamba to manage the packages.
Specifically, we build the conda environment with the following command.
mamba create -n linkpred -c bioconda -c nvidia -c pytorch -c pyg python=3.11 cuda-version=12.1 pytorch torchvision torchaudio pytorch-cuda=12.1 snakemake graph-tool scikit-learn numpy==1.23.5 numba scipy==1.10.1 pandas polars networkx seaborn matplotlib gensim ipykernel tqdm black faiss-gpu pyg pytorch-sparse python-igraph -y
pip install adabelief-pytorch==0.2.0
pip install GPUtil powerlawYou can also use the environment.yml file to create the conda environment.
mamba env create -f environment.ymlAdditionally, we need the following custom packages to run the experiments.
- gnn_tools provides the code for generating graph embeddings using the GNNs. We used the version 1.0
- embcom provides supplementary graph embedding methods. We used the version 1.01
- LFR-benchmark provides the code for the LFR benchmark. We used version 1.01.
These packages can be installed via pip as follows:
pip install git+https://github.com/skojaku/[email protected]
pip install git+https://github.com/skojaku/[email protected]And to install the LFR benchmark package:
git clone https://github.com/skojaku/LFR-benchmark
cd LFR-benchmark
python setup.py build
pip install -e .We provide the snakemake file to run the experiments. Before running the snakemake, you must create a config.yaml file under the reproduction/workflow/ directory.
data_dir: "data/"
small_networks: Faleswhere data_dir is the directory where all data will is located, and small_networks is a boolean value indicating whether to run the experiments for the small networks for testing the code.
Once you have created the config.yaml file, move under the reproduction/ directory and run the snakemake as follows:
snakemake --cores <number of cores> allor conveniently,
nohup snakemake --cores <number of cores> all >log &The Snakemake will preprocess the data, run the experiments, and generate the figures in reproduction/figs/ directory.
New networks can be added to the experiment by adding a new file to the reproduction/data/raw directory.
The file should be in the edge list format with 2 columns representing the source and destination nodes of the network, e.g.,
1 2
1 3
1 4
where each row forms an edge between the source and destination nodes, and the node IDs should start from 1.