0% found this document useful (0 votes)
37 views35 pages

Matrix Factorization For Inferring Associations and Missing Links

This document discusses novel matrix factorization methods for missing link prediction in networks, focusing on Weighted, Boolean, and Recommender matrix factorization techniques. It emphasizes the importance of rank selection and introduces uncertainty quantification to enhance prediction reliability. The proposed methods are validated through experiments on synthetic datasets and real-world protein-protein interaction networks, demonstrating improved performance over existing techniques.

Uploaded by

maryht1706
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views35 pages

Matrix Factorization For Inferring Associations and Missing Links

This document discusses novel matrix factorization methods for missing link prediction in networks, focusing on Weighted, Boolean, and Recommender matrix factorization techniques. It emphasizes the importance of rank selection and introduces uncertainty quantification to enhance prediction reliability. The proposed methods are validated through experiments on synthetic datasets and real-world protein-protein interaction networks, demonstrating improved performance over existing techniques.

Uploaded by

maryht1706
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Matrix Factorization for Inferring Associations and Missing Links

RYAN BARRON∗ , Theoretical Division, Los Alamos National Laboratory, USA


MAKSIM E. EREN∗ , Information Systems and Modeling, Los Alamos National Laboratory, USA
DUC P. TRUONG∗ , Theoretical Division, Los Alamos National Laboratory, USA
CYNTHIA MATUSZEK, Department of Computer Science and Electrical Engineering, University of Maryland,
Baltimore County, USA
JAMES WENDELBERGER, Computer, Computational, and Statistical Sciences, Los Alamos National Laboratory,
arXiv:2503.04680v1 [cs.LG] 6 Mar 2025

USA
MARY F. DORN, Computer, Computational, and Statistical Sciences, Los Alamos National Laboratory, USA
BOIAN ALEXANDROV, Theoretical Division, Los Alamos National Laboratory, USA
Missing link prediction is a method for network analysis, with applications in recommender systems, biology, social sciences,
cybersecurity, information retrieval, and Artificial Intelligence (AI) reasoning in Knowledge Graphs. Missing link prediction identifies
unseen but potentially existing connections in a network by analyzing the observed patterns and relationships. In proliferation
detection, this supports efforts to identify and characterize attempts by state and non-state actors to acquire nuclear weapons or
associated technology - a notoriously challenging but vital mission for global security. Dimensionality reduction techniques like
Non-Negative Matrix Factorization (NMF) and Logistic Matrix Factorization (LMF) are effective but require selection of the matrix
rank parameter, that is, of the number of hidden features, 𝑘, to avoid over/under-fitting. We introduce novel Weighted (WNMFk),
Boolean (BNMFk), and Recommender (RNMFk) matrix factorization methods, along with ensemble variants incorporating logistic
factorization, for link prediction. Our methods integrate automatic model determination for rank estimation by evaluating stability
and accuracy using a modified bootstrap methodology and uncertainty quantification (UQ), assessing prediction reliability under
random perturbations. Additionally, we incorporate Otsu threshold selection and k-means clustering for Boolean matrix factorization,
comparing them to coordinate descent-based Boolean thresholding. Our experiments highlight the impact of rank 𝑘 selection, evaluate
model performance under varying test-set sizes, and demonstrate the benefits of UQ for reliable predictions using abstention. We
validate our methods on three synthetic datasets (Boolean and Gaussian distributed) and benchmark them against LMF and symmetric
LMF (symLMF) on five real-world protein-protein interaction networks, showcasing an improved prediction performance.

CCS Concepts: • Computing methodologies → Dimensionality reduction and manifold learning; Non-negative matrix
factorization; Boosting.
∗ Authors contributed equally to this research.

Authors’ Contact Information: Ryan Barron, [email protected], Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM, USA; Maksim E.
Eren, [email protected], Information Systems and Modeling, Los Alamos National Laboratory, Los Alamos, NM, USA; Duc P. Truong, [email protected],
Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM, USA; Cynthia Matuszek, [email protected], Department of Computer Science and
Electrical Engineering, University of Maryland, Baltimore County, Baltimore, MD, USA; James Wendelberger, [email protected], Computer, Computational,
and Statistical Sciences, Los Alamos National Laboratory, Los Alamos, NM, USA; Mary F. Dorn, [email protected], Computer, Computational, and
Statistical Sciences, Los Alamos National Laboratory, Los Alamos, NM, USA; Boian Alexandrov, [email protected], Theoretical Division, Los Alamos
National Laboratory, Los Alamos, NM, USA.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM

Manuscript submitted to ACM 1


2 Barron, Eren, Truong et al.

Additional Key Words and Phrases: AI reasoning, missing links prediction, network analysis, matrix factorization, Boolean, data
completion

ACM Reference Format:


Ryan Barron, Maksim E. Eren, Duc P. Truong, Cynthia Matuszek, James Wendelberger, Mary F. Dorn, and Boian Alexandrov. 2018.
Matrix Factorization for Inferring Associations and Missing Links. Proc. ACM Meas. Anal. Comput. Syst. 37, 4, Article 111 (August 2018),
35 pages. https://doi.org/XXXXXXX.XXXXXXX

1 Introduction
Link prediction is a network analysis technique to infer missing or future connections in a network based on its current
structure and patterns in the known interactions. The task of predicting these absent but potentially existing links is
an important problem across various domains, including biological network analysis, where it helps uncover novel
protein-protein interactions (PPI) [60, 64], social network analysis for community detection [27], recommender systems
for personalized suggestions [23, 41, 72], cybersecurity for anomaly detection [3, 24], and explainable reasoning over
Knowledge Graphs [71].
Matrix factorization techniques, such as Non-negative Matrix Factorization (NMF) and Logistic Matrix Factorization
(LMF), have emerged as practical dimensionality reduction approaches for link prediction [43, 60]. By decomposing the
network of known interactions, represented by a matrix, into lower-dimensional components, these methods aim to
capture the network’s latent (or hidden) structural patterns, enabling the accurate prediction of missing links based on
the existing interactions. However, these models are sensitive to the choice of the rank parameter, 𝑘, which determines
the dimensionality of the reduced space. If 𝑘 is selected to be too small, it results in patterns to mix (under-fitting). If 𝑘
is chosen to be too large, the results include noise (over-fitting), potentially reducing the predictive performance of the
missing links.
To address this challenge, here we propose novel extensions of NMF-based approaches: Weighted (WNMFk), Boolean
(BNMFk), and Recommender (RNMFk) matrix factorization, along with ensemble variants incorporating logistic
factorization. These methods integrate automatic model determination and uncertainty quantification (UQ). Our
automatic model determination heuristically identifies the optimal rank, 𝑘, by evaluating solution stability and accuracy
[54]. At the same time, UQ leverages a modified bootstrap methodology to quantify prediction reliability under random
perturbations of the input matrix. Furthermore, we incorporate k-means clustering [36] and Otsu’s method [58] into our
framework for Boolean matrix factorization. These methods are employed to determine thresholds for obtaining Boolean
latent factors. A comparative analysis is performed for these thresholding approaches alongside the coordinate descent-
based thresholding technique, evaluating their relative effectiveness in Boolean matrix factorization. Additionally, we
compare the performance of Boolean matrix factorization with that of non-Boolean factorizations [18]. We assess
their effectiveness in predicting the correct matrix rank for both Boolean and non-Boolean cases and their accuracy in
identifying missing links.
In this paper, we evaluate the effectiveness of the proposed methods through experiments conducted on three
synthetic datasets and five real-world PPI networks [60]. The experiments on synthetic datasets emphasize the critical
role of rank selection and demonstrate that uncertainty-aware approaches can enhance model reliability by incorporating
a reject option, where the model abstains from making predictions when uncertainty is high. Additionally, we analyze
the impact of data sparsity by systematically increasing the test set size on the synthetic datasets. The proposed
methods are benchmarked against LMF [37] and symLMF [60] on the PPI networks, showcasing improvements in
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 3

prediction performance. Introduced methods are also compared against NMF with automatic model determination
(NMFk) [4, 5, 8, 34] on the synthetic datasets. Our contributions include:
• Introduce three new link prediction methods (WNMFk, BNMFk, and RNMFk), along with ensemble variants that
incorporate logistic factorization: WNMFklmf , BNMFklmf , and RNMFklmf .
• Demonstrate that adding logistic factorization as an ensemble component to WNMFk, BNMFk, and RNMFk
improves missing link prediction on PPI datasets.
• Adopt 𝑘-means clustering and Otsu-based thresholding for Boolean matrix factorization.
• Compare WNMFk, BNMFk, and RNMFk in Boolean and non-Boolean settings to evaluate their accuracy in
predicting the correct matrix rank and link prediction performance.
• Highlight the importance of choosing an appropriate rank to improve link prediction on Boolean and Gaussian
distributed synthetic datasets.
• Analyze the impact of data sparsity on link prediction performance.
• Introduce a UQ framework for our link prediction methods, offering a reject option for uncertain predictions.
We show how this option improves overall accuracy by reducing the coverage rate– the fraction of samples on
which the model abstains from making a decision ("I do not know" option).
• Present a user-friendly Python library, Tensor Extraction of Latent Features (T-ELF), that implements the proposed
methods with support for multi-processing, Graphics Processing Unit (GPU) acceleration, and High-Performance
Computing (HPC) environments to handle large-scale computations [21]1 .
The remainder of the paper is structured as follows: Section 2 reviews related work. Section 3 introduces the
background on NMF, LMF, and NMFk, followed by details on the proposed methods (WNMFk, BNMFk, and RNMFk), the
LMF extension to the methods, Otsu thresholding, k-means clustering for Boolean settings, and UQ system integration.
Section 4 describes the datasets. The experimental setup, including cross-validation, train/test sampling, performance
metrics, and computational resources, is outlined in Section 5. Experiments on rank prediction, data sparsity, and
Boolean vs. non-Boolean factorization with synthetic datasets are in Sections 6.1, 6.2, and 6.3. Section 6.4 presents results
for LMF-extended WNMFklmf , BNMFklmf , and RNMFklmf , benchmarked against LMF and symLMF on real-world PPI
datasets. Section 7 briefly introduces the public Python library implementing these methods. Appendix A provides
additional results for a more comprehensive presentation.

2 Relevant Work
Matrix factorization (MF) techniques have been widely used for predicting missing links in networks by leveraging latent
structure and graph embedding methods to infer potential but previously unseen connections [15, 37, 44, 51, 55, 61, 73].
In the context of NMF, early research addressed incomplete observations in user-rating data [38], introducing weighted
NMF (WNMF) for collaborative filtering tasks. Further adaptations incorporated node attributes to tackle sparsity in link
prediction, such as GJSNMF [67] and JWNMF [66], while others focused on recommender systems using constraints [31]
or federated learning [23, 25]. Ensemble NMF approaches [20] split large graphs into subproblems for better scalability,
and other work introduced graph regularization or deep architectures [50, 74], often addressing extreme sparsity with
additional complexity.
LMF [37] offers a probabilistic alternative for modeling binary or implicit feedback, with extensions that incorporate
neighborhood structure [46, 47] or symmetry constraints [60] to capture local and global relationships in network
1 T-ELF is available at https://github.com/lanl/T-ELF
Manuscript submitted to ACM
4 Barron, Eren, Truong et al.

data. Beyond real-valued methods, Boolean or binary factorizations are useful in logical structures, employing 0/1
constraints to represent observed links. Pioneering work on Boolean matrix factorization (BMF) investigated logical
and arithmetic operations [19]. In this work, we integrate Otsu’s method [58] and k-means clustering [36] into Boolean
factorization as thresholding mechanisms. Similarly, [9] employed Otsu’s method and k-means clustering to segment
brain diffusion imaging based on a reduced feature space obtained using NMF. While [9] applied these techniques
to classify the features of interest in the images such as white matter, gray matter, and cerebrospinal fluid using the
latent factors, our approach differs in that we integrate them directly into the factorization process to impose Boolean
constraints on the latent factors. Rank selection remains a critical challenge in these various MF formulations, with
recent attempts at cross-validation [42], bootstrap-based stability analyses [13], or other regularization approaches [35]
that studied balancing model complexity and predictive accuracy.
Besides missing link prediction in conventional networks, MF has significantly impacted AI reasoning within
Knowledge Graphs (KGs) [14, 71], representing relationships among entities as triple-based structures. Symbolic
reasoning methods in KGs often suffer from scalability issues and incomplete knowledge, leading to the adoption of
MF-based embedding techniques that uncover latent structures for knowledge base completion [56]. Decomposing
adjacency matrices corresponding to KG relations allows AI systems to predict new connections, improving coverage
and inferential power. These factorizations also benefit recommendation systems by fusing user-item interactions with
explicit semantic links from the KG, offering more explainable recommendations [71]. In the life sciences, large-scale
biological graphs have utilized MF to hypothesize previously unknown interactions among genes, proteins, or drugs
[76], highlighting the versatility of factorization-based approaches for reasoning-driven discovery.
Despite the progress of MF in missing link prediction and AI reasoning, two key challenges remain. First, UQ is
critical yet largely unaddressed. Most methods yield only point estimates and cannot measure how trustworthy each
prediction is, although such measures are vital in high-stakes domains like healthcare and finance [32, 46]. Second, the
problem of optimal rank selection persists, with underestimation or overestimation leading to suboptimal performance
[42, 53]. To address these gaps, our work proposes novel Weighted, Boolean, and Recommender NMFk factorization
methods, each incorporating automated rank determination and deterministic thresholding. We further enrich these
methods through ensemble strategies and logistic components, and we introduce a bootstrap-based UQ mechanism that
allows an abstention option for highly uncertain predictions. Combining thresholding, ensemble-driven rank selection,
and confidence scoring aims to improve the reliability, interpretability, and applicability of MF-based link prediction in
classic networks and knowledge-graph-driven AI systems.

3 Methods
In this section, we discuss the methodologies and techniques underlining our proposed approaches. First, we review the
NMF (Section 3.1) and LMF (Section 3.2) approaches, as well as the NMFk method (Section 3.3), which integrates an
automatic model determination strategy to NMF, forming the foundational background for our introduced methods.
Sections 3.4, 3.5, and 3.6 present the WNMFk, RNMFk, and BNMFk methods, respectively. We describe the ensemble
approaches with these methods by integrating LMF in Section 3.7. In Section 3.8, the Boolean perturbation technique is
introduced, and in Section 3.9, we discuss the Boolean clustering scheme employed in our methods when performing
decomposition under Boolean settings. Additionally, Section 3.10 introduces the adoption of Otsu thresholding and
k-means clustering for thresholding latent factors in Boolean decomposition, along with a description of the coordinate
descent-based approach. Note that the Boolean operations discussed in Sections 3.8, 3.9, and 3.10 are integrated to
BNMFk and are utilized to enable its functionality in Boolean settings. In our experiments, Section 6, these operations
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 5

Table 1. Summary of the notation styles used in the paper.

Notation Description Notation Description


𝑥 Scalar x𝑖 𝑖th element in the vector
x Vector X𝑖 𝑗 Entry at row 𝑖, column 𝑗
X Matrix X𝑖: 𝑖th row
X Tensor X:𝑗 𝑗th column
X ::𝑖 𝑖th slice (3rd dim) X :::𝑖 𝑖th slice (4th dim)
X name Superscript identifier ∗ Dot product
⊙ Element-wise product ⊗𝐵 Boolean matrix multiplication

are also applied to WNMFk and RNMFk to evaluate their performance under Boolean conditions. Conversely, the
Boolean settings in BNMFk can be turned off to evaluate the method’s performance under non-Boolean optimization,
providing a comprehensive analysis of its adaptability across different data environments. Finally, we present the
integration of the UQ framework into our system in Section 3.11. For ease of reference, a summary of the notations
used throughout the paper is provided in Table 1.

3.1 Non-negative Matrix Factorization (NMF)


NMF is an unsupervised learning method based on low-rank approximation that reduces dimensionality. NMF approxi-
mates a given observation matrix X ∈ R𝑛×𝑚
+ with non-negative entries, as a product of two non-negative matrices,
i.e., X ≈ WH and X̂ = WH, where W ∈ R+ , and H ∈ R𝑘+×𝑚 , and usually 𝑘 ≪ 𝑚, 𝑛. Here, 𝑛 is the number of features
𝑛×𝑘

(rows), 𝑚 is the number of samples (columns), and the small dimension 𝑘 is the low-rank of the approximation. We
perform this factorization via a non-convex minimization with non-negativity constraint, utilizing the multiplicative
updates algorithm [44], and Frobenius norm as the distance metric, with an objective function:

min ||X − WH|| 2𝐹 , (1)


W∈R𝑛×𝑘 𝑘 ×𝑚
+ , H∈R+

which allows NMF to be treated as a Gaussian mixture model [26]. In Equation 1, the factors W and H are the solution
of the optimization problem, the latent factors, estimated via alternative updates, with update rules in Equations 2 and
3 respectively, and the performance of the minimization is evaluated by the relative reconstruction error in Equation 4:

XH𝑇 W𝑇 X ||X − WH|| 2𝐹


W=W⊙ , (2) H=H⊙ , (3) Relative Error = . (4)
WHH𝑇 W𝑇 WH ||X|| 2𝐹

3.2 Logistic Matrix Factorization (LMF)


LMF extends the principles of matrix factorization to binary data, where the observed matrix X ∈ {0, 1}𝑛×𝑚 represents
the presence or absence of interactions or a known link [37]. Unlike NMF, which minimizes reconstruction error using
the Frobenius norm, LMF incorporates the logistic regression model to capture the likelihood of binary observations. In
the binary interaction matrix X ∈ {0, 1}𝑛×𝑚 , each entry X𝑖 𝑗 represents the presence (𝑋𝑖 𝑗 = 1) or absence (𝑋𝑖 𝑗 = 0) of a
link between node 𝑖 and node 𝑗. However, in many real-world datasets, some interactions are unobserved or missing
(i.e., the missing link). To account for this, binary mask matrix is defined M ∈ {0, 1}𝑛×𝑚 , where M𝑖 𝑗 = 1 if X𝑖 𝑗 is
observed or a known link, and M𝑖 𝑗 = 0 for the missing links. For instance, in PPI networks, X𝑖 𝑗 = 1 might indicate a
known interaction between proteins 𝑖 and 𝑗, while X𝑖 𝑗 = 0 represents the absence of an interaction. If the interaction
status between proteins 𝑖 and 𝑗 has not been experimentally determined, M𝑖 𝑗 = 0 is used to indicate the missing entry
Manuscript submitted to ACM
6 Barron, Eren, Truong et al.

at X𝑖 𝑗 . In LMF, X is approximated as:

X̂ = 𝜎 (WH + b𝑟 + b𝑐 ), (5)

where W ∈ R𝑛×𝑘 is the row latent feature matrix, H ∈ R𝑘 ×𝑚 is the column latent feature matrix, b𝑟 ∈ R𝑛×1 is the
row bias vector (broadcasted across columns), b𝑐 ∈ R1×𝑚 is the column bias vector (broadcasted across rows), and the
element-wise logistic sigmoid function:
1
𝜎 (𝑥) = . (6)
1 + 𝑒 −𝑥
Row and column biases are incorporated into the LMF model to account for systematic variations in the data. The
row bias b𝑟 ∈ R𝑛×1 captures row-specific effects, such as a node’s general tendency to form links across all columns.
Similarly, the column bias b𝑐 ∈ R1×𝑚 models column-specific effects, such as a particular node’s propensity to attract
connections. These biases are added to the matrix reconstruction, allowing the model to better fit the data by accounting
for individual node-specific tendencies. For example, in a PPI network, the row bias b𝑟 could represent the inherent
interaction likelihood of a specific protein across all other proteins. In contrast, the column bias b𝑐 adjusts for the
overall connectivity tendency of a target protein across all rows.
Each entry X̂𝑖 𝑗 in the reconstructed matrix represents the predicted probability of X̂𝑖 𝑗 . The optimization objective in
LMF minimizes the negative log-likelihood of the binary observations under the logistic model:
𝑛 ∑︁
∑︁ 𝑚    
minimize − M𝑖 𝑗 X𝑖 𝑗 log X̂𝑖 𝑗 + (1 − X𝑖 𝑗 ) log(1 − X̂𝑖 𝑗 ) + 𝜆 ||W|| 2𝐹 + ||H|| 2𝐹 + ||b𝑟 || 22 + ||b𝑐 || 22 , (7)
W,H,b𝑟 ,b𝑐
𝑖=1 𝑗=1

where 𝜆 is the regularization parameter to prevent overfitting. The optimization problem is solved using gradient-based
methods. The gradients for each parameter are computed as follows:

𝜕𝐿
= M ⊙ ( X̂ − X)H𝑇 + 𝜆W, (8)
𝜕W

𝜕𝐿  
= W𝑇 M ⊙ ( X̂ − X) + 𝜆H, (9)
𝜕H
𝑚
𝜕𝐿 ∑︁
= M ⊙ ( X̂ − X) + 𝜆b𝑟 , (10)
𝜕b𝑟 𝑗=1

𝑛
𝜕𝐿 ∑︁
= M ⊙ ( X̂ − X) + 𝜆b𝑐 , (11)
𝜕b𝑐 𝑖=1

where
1 𝜆  
∥M ⊙ ( X̂ − X)∥ 2𝐹 +
𝐿= ∥W∥ 2𝐹 + ∥H∥ 2𝐹 + ∥b𝑟 ∥ 2𝐹 + ∥b𝑐 ∥ 2𝐹 . (12)
2 2
The parameters W, H, b𝑟 , and b𝑐 are updated iteratively using gradient descent:
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿
W ← W−𝜂 , H ← H−𝜂 , b𝑟 ← b𝑟 − 𝜂 , b𝑐 ← b𝑐 − 𝜂 , (13)
𝜕W 𝜕H 𝜕b𝑟 𝜕b𝑐
where 𝜂 is the learning rate. LMF is particularly well-suited for sparse, binary datasets, as it naturally handles probabilistic
modeling of binary outcomes. Adding biases for rows and columns allows for improved flexibility in capturing systematic
variations in data.
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 7

3.3 Non-negative Matrix Factorization with Automatic Model Determination (NMFk)


The NMF minimization requires prior knowledge of the latent dimensionality, 𝑘, and the number of latent features,
which is often unknown. Choosing too small a value for 𝑘 leads to a poor approximation of the observables in X (a
problem called under-fitting). At the same time, setting 𝑘 too large makes the extracted features hard to interpret,
as they also capture noise in the data (a case of over-fitting). In essence, selecting 𝑘 is equivalent to estimating the
number of model parameters, a well-known difficult problem. In general, the existing partial solutions to this problem
are heuristic. Among these solutions is Automatic Relevance Determination (ARD) [49], which was first modified for
Principal Component Analysis [11] and then for NMF [52, 65]. Another approach is based on the assumed stability
of the NMF solution and was proposed to identify the number of stable clusters in the observational matrix X [12].
A recent model selection technique, called NMFk [5], has been successfully used to decompose the most extensive
collection of human cancer genomes [7]. NMFk integrates classical NMF-minimization with custom clustering and
Silhouette statistics [62] and combines the accuracy of the minimization and robustness/stability of the NMF solutions
when a modified bootstrap procedure (i.e., generation of a random ensemble of slightly perturbed input matrices) is
applied to estimate the number of latent features [8]. Recently, NMFk was applied to many synthetic datasets with
a predetermined number of latent features, and it demonstrated its superior performance of correctly estimating 𝑘
compared to the other known heuristics [54]. The exceptional performance of the NMFk method as a model selection
was also demonstrated both in practice [6] and in a large set of synthetic cancer genomes with a predetermined number
of latent features [34]. In addition, it was shown that NMFk performs better than spherical k-means and other methods
for topic extraction [70]. Therefore, we use NMFk as the core factorization method with automatic model selection and
adopt it to WNMFk, RNMFk, and BNMFk for heuristically estimating 𝑘. For completeness, we provide the pseudocode
for it in Algorithm 1 and a description of it as follows:

Algorithm 1 NMFk(X, 𝑘𝑚𝑖𝑛 , 𝑘𝑚𝑎𝑥 , 𝑀, 𝑆𝑖𝑙𝑙_𝑡ℎ𝑟 = 0.8)


Require: : X ∈ R𝑛×𝑚 + , 𝑘𝑚𝑖𝑛 , 𝑘𝑚𝑎𝑥 , 𝑟
1: for 𝑘 in 𝑘 𝑚𝑖𝑛 to 𝑘 𝑚𝑎𝑥 do ⊲ Start and end process for NMFk
2: for 𝑞 in 1 to 𝑀 do ⊲ Num. of Perturbations on each k
3: X ::𝑞 = Perturb(X) ⊲ Resampling X to create a random ensemble
4: H ::𝑘𝑞 = NMF(X
W ::𝑘𝑞 ,H X::𝑞 ,k)
5: end for
6: W𝑎𝑙𝑙 =[W W::𝑘1 ,. . . ,W
W::𝑘𝑀 ] and H 𝑎𝑙𝑙 =[H H ::𝑘1 ,. . . ,H
H ::𝑘𝑀 ]
7: Ŵ, Ĥ = customCluster( W𝑎𝑙𝑙 ,H
Ŵ H𝑎𝑙𝑙 )
8: We ::𝑘 = medians( Ŵ Ŵ)
H ::𝑘 = NNLS(X,W
𝑟𝑒𝑔 e ::𝑘 )
9: ⊲ Column-wise regression of H with We and column of X
10: s𝑘 = clusterStability(Ŵ Ŵ)
11: errk = reconstructErr(X,W e ::𝑘 , H𝑟𝑒𝑔 ) ⊲ Column-wise reconstruction error for L-statistics
::𝑘
12: end for
13: err𝑎𝑙𝑙 =[err𝑘𝑚𝑖𝑛 ,. . . ,err𝑘𝑚𝑎𝑥 ]
14: 𝑘 𝑜𝑝𝑡 = PvalueAnalysis(err𝑎𝑙𝑙 ,𝑘𝑚𝑖𝑛 ,𝑘𝑚𝑎𝑥 ,s𝑘 ,𝑆𝑖𝑙𝑙_𝑡ℎ𝑟 ) ⊲ Predicted k value using Wilcoxon
15: return W e ::𝑘 𝑜𝑝𝑡 , H𝑟𝑒𝑔𝑜𝑝𝑡 , 𝑘 𝑜𝑝𝑡
::𝑘
Ensure: 𝑘 = 𝑘 𝑜𝑝𝑡 ,W H ::𝑘 𝑜𝑝𝑡 ∈ R𝑘+×𝑚 , X = W
e ::𝑘 𝑜𝑝𝑡 H𝑟𝑒𝑔𝑜𝑝𝑡
e ::𝑘 𝑜𝑝𝑡 ∈ R𝑛×𝑘 ,H 𝑟𝑒𝑔
+ ::𝑘

X::𝑞 ]𝑞=1,...,𝑀 ,
(1) Resampling: Based on the observable matrix, X, NMFk creates an ensemble of 𝑀 random matrices, [X
with means equal to the original matrix X. Each one of these random matrices X ::𝑞 is generated by perturbing
Manuscript submitted to ACM
8 Barron, Eren, Truong et al.

the elements of X by a small uniform noise, such that: X𝑖 𝑗𝑞 = X𝑖 𝑗 + 𝛿𝑖 𝑗𝑞 , for each 𝑞 = 1, ..., 𝑀, where 𝛿𝑖 𝑗𝑞 is the
small error that is part of random distribution, different error for each 𝑞.
(2) NMF minimization: We use the Frobenius norm-based multiplicative updates (MU) algorithm [44] for different
numbers of latent features, 𝑘, in an interval [𝑘𝑚𝑖𝑛 , 𝑘𝑚𝑎𝑥 ], for each generated 𝑀 random matrices.
(3) Custom clustering: For each 𝑘 ∈ [𝑘𝑚𝑖𝑛 , 𝑘𝑚𝑎𝑥 ], NMF minimizations of the 𝑀 random matrices, [X X::𝑞 ]𝑞=1,...,𝑀 ,
W::𝑘𝑞 ; H ::𝑘𝑞 ]𝑞=1,...,𝑀 . Further, NMFk clusters the set of the 𝑀 ∗ 𝑘 latent features, the columns
results in 𝑀 pairs [W
of W ::𝑘𝑞 . The NMFk custom clustering is similar to k-means, but it holds exactly one column for each cluster
from each of the 𝑀 NMF solutions. This constraint is needed since each NMF minimization gives exactly one
solution W ::𝑘𝑞 with the same number of columns, 𝑘. In the clustering, the cosine similarity metric measures the
similarity between the columns. Here, cosine similarity determines the degree of similarity between two vectors
in an inner product space.
(4) Robust W and H for each 𝑘: The medians of the clusters, We ::𝑘 , are the robust solution for each explored 𝑘. The
corresponding mixing coefficients H ::𝑘 are calculated by regression of X on W
𝑟𝑒𝑔 e ::𝑘 .
(5) Cluster stability via Silhouette statistics: NMFk explores the stability of the obtained clusters, for each 𝑘, by
calculating their Silhouettes [62]. Silhouette scores quantify the cohesion and separability of the clusters. The
Silhouette values range between [−1, 1], where −1 means an unstable cluster and +1 means perfect stability.
(6) Reconstruction error: Another metric NMFk uses is the relative reconstruction error, 𝑅 = ||X − X𝑟𝑒𝑐
::𝑘
||/||X||, where
X𝑟𝑒𝑐 = W
::𝑘
e ::𝑘 ∗ H𝑟𝑒𝑔 , which measures the accuracy of the reproduction of initial data by a given solution and the
::𝑘
number of latent features 𝑘.
(7) L-statistics: NMFk uses L-statistics [69] to automatically estimate the number of latent features. To calculate L-
X𝑟𝑒𝑐
statistics for each 𝑘, NMFk records the distributions of the column reconstruction errors, e𝑖 = ∥X:𝑗 −X :𝑗𝑘
∥/∥X:𝑗 ∥;
𝑗 = 1, ..., 𝑚. L-statistics compares the distributions of column errors for different 𝑘 by a two-sided Wilcoxon
rank-sum test to evaluate if two samples are taken from the same population [29].
(8) NMFk final solution: The optimal number of latent features, 𝑘 𝑜𝑝𝑡 , is the largest stable cluster count with low
reconstruction error. The Wilcoxon rank-sum test computes the p-value for 𝑘 𝑜𝑝𝑡 , ensuring subsequent column
error distributions remain statistically similar, indicating noise fitting. L-statistics and a minimum Silhouette
threshold of 0.80 guide selection, preventing overfitting. The corresponding W e ::𝑘 𝑜𝑝𝑡 and H𝑟𝑒𝑔𝑜𝑝𝑡 are the robust
::𝑘
solutions for the low-rank factor matrices.

Figure 1 shows a sample Silhouette score and relative error plot from NMFk for two factorizations, demonstrating
𝑘 selection. NMFk estimates latent features based on two criteria: a high minimum Silhouette score and low relative
reconstruction error for a stable NMF solution. Lower Silhouette scores indicate overlapping clusters, while recon-
struction error decreases monotonically, with a sharp drop before stabilizing near the estimated 𝑘. Beyond this point,
overfitting causes a sudden Silhouette decline. Therefore, 𝑘 with minimum Silhouette greater than the given threshold
of 0.80 is heuristically selected as the optimal rank 𝑘.

3.4 Weighted Non-Negative Matrix Factorization with Automatic Model Determination (WNMFk)
While NMF, as summarized in Section 3.1, is an effective method for extracting meaningful latent features, it minimizes
the distance between X and the approximation WH as defined in Equation 1. This minimization includes zero entries in
X, causing the approximation WH to be forced toward zero for those entries. However, in link prediction tasks, the
network nodes interact with only a small subset of other nodes, resulting in a highly sparse matrix X. Standard NMF
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 9

X(n,615) X(d,615)

Fig. 1. Sample Silhouette and relative error graphs obtained from NMFk applied to 615 malware specimens are shown [22].

minimization fails here because the zeros in X, representing missing values, are exactly the links we aim to predict
rather than the values we wish to minimize. Instead, the optimization must consider only the known entries in X.
Estimating the missing values in this way, as recommendations or predictions of missing links, constitutes a matrix
completion problem [72].
WNMFk extends the NMF framework by incorporating element-wise weights during factorization. This modification
enables WNMFk to account for varying confidence levels in the observed data, providing robustness in scenarios where
some matrix elements are less reliable than others. Additionally, WNMFk extends standard WNMF [39] by integrating
the NMFk framework for automatic rank determination, addressing the challenge of selecting the optimal number of
latent features, 𝑘.
Given a matrix X ∈ R𝑛×𝑚
+ and an associated weight matrix M ∈ {0, 1}𝑛×𝑚 , where M𝑖 𝑗 > 0 indicates that the
corresponding entry X𝑖 𝑗 is observed or a known link and M𝑖 𝑗 = 0 indicates that the entry is unobserved or a missing
link, WNMFk seeks to approximate X as the product of two non-negative matrices such that X ≈ WH, where W ∈ R𝑛×𝑘
+
and H ∈ R𝑘+×𝑚 . The final prediction matrix is obtained as:

X̂ = WH. (14)

The weight matrix M prioritizes reconstructing observed entries while ignoring the missing ones. When the mask M
is binary, with values 0 and 1, instead of continuous weights, WNMFk naturally performs missing link prediction by
focusing solely on reconstructing the observed entries (M𝑖 𝑗 = 1) while ignoring the unobserved entries (M𝑖 𝑗 = 0). The
objective function for WNMF is defined as:

minimize ||M ⊙ (X − WH)|| 2𝐹 + 𝜆(||W|| 2𝐹 + ||H|| 2𝐹 ), (15)


W,H≥0

where 𝜆 is a regularization parameter to prevent overfitting. The optimization problem is solved iteratively using
multiplicative update rules for W and H. At each iteration, the residual matrix R is computed as R = X − WH. For each
Manuscript submitted to ACM
10 Barron, Eren, Truong et al.

latent factor 𝑘, the updates for W and H are as follows, and W and H are constrained to remain non-negative:

(M ⊙ R)H𝑘,:
𝑊:,𝑘 ← ⊤ H +𝜆 , (16)
MH𝑘,: 𝑘,:

(M ⊙ R) ⊤𝑊:,𝑘
𝐻𝑘,: ← . (17)
M⊤𝑊:,𝑘𝑊:,𝑘 + 𝜆
To automatically determine the optimal rank 𝑘, WNMFk leverages the NMFk framework, as described in Section 3.3.
H ::𝑘𝑞 = NMF(X
More specifically, line 4 in Algorithm 1, where W ::𝑘𝑞 ,H X::𝑞 ,k), is replaced by the factorization procedure
as described in this section. WNMFk identifies the rank that balances reconstruction accuracy and stability across
perturbations by applying the perturbation-based ensemble approach and clustering stability metrics. Furthermore,
with WNMFk, the mask matrix M can contain non-Boolean entries, allowing continuous values to be used as weights
during approximation. For instance, if specific entries in the observed matrix X are less reliable or should contribute
less to the optimization process, their corresponding weights in M, such as M𝑖 𝑗 , can be assigned lower values. For
example, in a recommendation system, if a user’s rating for a specific item is suspected to be influenced by noise (e.g., a
randomly assigned rating rather than a genuine preference), the corresponding entry in M could be assigned a lower
weight to reduce its influence on the factorization. This flexibility makes WNMFk particularly useful in scenarios with
varying confidence levels across data points.

3.5 Recommender Non-Negative Matrix Factorization with Automatic Model Determination and Biases
(RNMFk)
RNMFk incorporates biases to account for systematic user and item tendencies. Like WNMFk, RNMFk integrates
the NMFk framework to automatically determine the optimal rank 𝑘, addressing the challenge of model selection.
While WNMFk focuses on reconstructing observed entries in the data matrix using a weight mask, RNMFk introduces
additional bias terms to improve performance in scenarios where user- and item-specific effects must be explicitly
modeled. The mask M contains boolean values (0s or 1s) to signify the missing and known entries. The biases help
capture variations such as individual user preferences or global item popularity, which cannot be represented solely by
the latent factors W and H. In this work, we extend and modify the Collaborative NMF algorithm presented in the
Surprise package [33], adopt the code to leverage vector multiplication in our Python package T-ELF, and integrate with
the NMFk framework.
Given a data matrix X ∈ R𝑛×𝑚
+ , where 𝑛 represents users or number of rows and 𝑚 represents items or number of
columns, RNMFk aims to approximate X by factoring it into:

X̂𝑖,𝑗 = W𝑖,: H:,𝑗 + 𝑏𝑊𝑖 + 𝑏𝐻 𝑗 + 𝜇, (18)

where: W ∈ R𝑛×𝑘+ is the user latent factor matrix, H ∈ R𝑘+×𝑚 is the item latent factor matrix, 𝑏𝑊𝑖 and 𝑏𝐻 𝑗 are the biases
for user 𝑖 and item 𝑗, respectively, 𝜇 is the global average rating, representing the group bias. This formulation extends
the standard NMF objective by explicitly modeling biases, which account for systematic user- or item-level variations.
The optimization objective in RNMFk is given by:

minimize ||X𝑖,𝑗 − (W𝑖,: H:,𝑗 + 𝑏𝑊𝑖 + 𝑏𝐻 𝑗 + 𝜇)|| 2𝐹 + 𝛼 ||W|| 2𝐹 + 𝛽 ||H|| 2𝐹 + 𝛾 ||𝑏𝑊 || 2𝐹 + 𝛿 ||𝑏𝐻 || 2𝐹 , (19)
W,H≥0, 𝑏𝑊 , 𝑏 𝐻

Manuscript submitted to ACM


Matrix Factorization for Inferring Associations and Missing Links 11

where 𝛼, 𝛽, 𝛾, and 𝛿 are regularization parameters for W, H, 𝑏𝑊 , and 𝑏𝐻 , respectively. The minimization is performed
only over the observed entries in X, making RNMFk suitable for matrix completion tasks. The parameters W, H, 𝑏𝑊 ,
and 𝑏𝐻 are estimated iteratively using the following updates:

𝑚
∑︁ 𝑛
∑︁
𝑏𝑊 ← 𝑏𝑊 + 𝜂𝑊 (err:,𝑗 − 𝛾𝑏𝑊 ), (20) 𝑏 𝐻 ← 𝑏 𝐻 + 𝜂𝐻 (err𝑖,: − 𝛿𝑏𝐻 ), (21)
𝑗=1 𝑖=1

XH⊤ W⊤ X
W←W⊙ , (22) H←H⊙ , (23)
X̂H⊤ + 𝛼W W⊤ X̂ + 𝛽H

where X̂𝑖,𝑗 = W𝑖,: H:,𝑗 + 𝑏𝑊𝑖 + 𝑏𝐻 𝑗 + 𝜇 is the predicted matrix, and err𝑖,𝑗 = X𝑖,𝑗 − X̂𝑖,𝑗 . To integrate automatic model
H ::𝑘𝑞 =
selection to RNMFk, we replace the factorization procedure of NMFk, line 4 in Algorithm 1, where W ::𝑘𝑞 ,H
X::𝑞 ,k), with the factorization procedure described in this section.
NMF(X

3.6 Boolean Non-Negative Matrix Factorization with Automatic Model Determination (BNMFk)
BNMFk extends the NMF framework to Boolean settings, where the data matrix X ∈ {0, 1}𝑛×𝑚 consists of binary values
and integrates NMFk for automatic rank determination [19, 68]. Unlike WNMFk and RNMFk, which use weighted and
biased approaches, BNMFk applies constraints to maintain the Boolean structure in both the data and the latent factor
matrices during factorization.
Given a binary matrix X ∈ {0, 1}𝑛×𝑚 , BNMFk approximates X as X ≈ W ⊗𝐵 H, where W ∈ {0, 1}𝑛×𝑘 is the Boolean
row latent factor matrix, and H ∈ {0, 1}𝑘 ×𝑚 is the Boolean column latent factor matrix. The final prediction matrix
was also obtained using the Equation 14. BNMFk aims to minimize the reconstruction error while maintaining the
binary constraints on W and H. Unlike traditional NMF, the factorization involves thresholding operations to ensure
Boolean values. We introduce the Boolean-specific clustering and thresholding operations in Sections 3.8, 3.9, and 3.10.
In our experiments, we also present results where we turn off these Boolean settings for BNMFk and turn them on for
WNMFk and RNMFk. The optimization problem for BNMFk is defined as:

minimize ||X − (W⊗𝐵 )H)|| 2𝐹 + 𝛼 ||W|| 2𝐹 + 𝛽 ||H|| 2𝐹 , (24)


W,H≥0

where 𝛼 and 𝛽 are regularization parameters that penalize large values in W and H. The Boolean constraints are enforced
through adaptive thresholding during updates. BNMFk employs multiplicative updates with adaptive thresholding to
ensure the Boolean structure of the latent factor matrices:
XH⊤
 
W ← threshold W ⊙ , 𝜏 , 𝜏 , (25)
WH⊤ H + 𝛼 low high

W⊤ X
 
H ← threshold H ⊙ , 𝜏 low , 𝜏 high , (26)
W⊤ WH + 𝛽
where the threshold function, described in Section 3.10, for restricting W and H to Boolean values. Here, 𝜏low and
𝜏high represent the lower and upper thresholds used in the thresholding operation for binarizing the matrices W and H.
BNMFk integrates the NMFk framework to determine the optimal rank 𝑘 automatically. Like WNMFk and RNMFk, we
replace the NMF procedure in NMFK, line 4 in Algorithm 1, to integrate an automatic model determination system.
Manuscript submitted to ACM
12 Barron, Eren, Truong et al.

3.7 LMF Extensions for Ensemble BNMFk, RNMFk, and WNMFk


We also introduce an ensemble approach for BNMFk, RNMFk, and WNMFk by integrating LMF from Section 3.2
into these methods (WNMFklmf , BNMFklmf , and RNMFklmf ). This extension leverages the predicted rank from the
automatic model determination process in plain BNMFk, RNMFk, and WNMFk, solves for LMF with the same rank, and
combines the reconstructed matrix X̂ from the original decomposition with the biases learned from LMF. The combined
output is then passed through a sigmoid function to produce probabilistic predictions.
The ensemble approach works as follows:
(1) X is first factorized with one of the models introduced above BNMFk, RNMFk, or WNMFk to obtain the predicted
rank 𝑘 and X̂ (Equation 14 for BNMFk and WNMFk, and Equation 18 for RNMFk) from this decomposition.
(2) The predicted rank 𝑘 from the automatic model determination step is used to initialize LMF, where the same X is
decomposed at the predicted rank 𝑘 using LMF.
(3) LMF solves the optimization problem to learn row and column biases (𝑏 r and 𝑏 c from Equation 5).
(4) Then the final prediction matrix X̃final is calculated as:
 
X̃final = 𝜎 X̂ + 𝑏 row + 𝑏 col . (27)

This ensemble approach aims to combine the strengths of the original matrix factorization method BNMFk, RNMFk,
or WNMFk with LMF, effectively combining the structural insights and adding the probabilistic modeling capabilities
of LMF.

3.8 Boolean Perturbations for Boolean Factorization


In the context of Boolean matrix factorization, Boolean perturbations are applied to generate an ensemble of slightly
modified versions of the binary input matrix X. These perturbations introduce controlled noise by flipping randomly
selected entries in X, ensuring diversity in the perturbed matrices while maintaining the Boolean structure. The
perturbation process can be described as follows:
• Positive Noise (Additive): A fraction of entries with value 0 in X are randomly flipped to 1.
• Negative Noise (Subtractive): A fraction of entries with value 1 in X are randomly flipped to 0.
The perturbed matrix Y is generated by applying these noise components Y = boolean(X, 𝜖) where 𝜖 = [𝜖pos, 𝜖neg ]
represents the proportion of positive and negative noise to be added. For each noise type:
• 𝜖pos : Proportion of 0s in X flipped to 1s.
• 𝜖neg : Proportion of 1s in X flipped to 0s.
We replace line 3 on Algorithm 1 of NMFk, where X ::𝑞 = Perturb(X), with Boolean perturbation when applying Boolean
factorization.

3.9 Boolean Clustering for Boolean Factorization


In the context of Boolean factorization, the clustering procedure used in the standard NMFk framework, specifically line
7 of Algorithm 1 where Ŵ H𝑎𝑙𝑙 ), is replaced with a Boolean clustering approach described
Ŵ, Ĥ = customCluster( W𝑎𝑙𝑙 ,H
in this section. This adjustment ensures that the clustering step respects the Boolean nature of the data, making it
more suitable for binary datasets. Boolean clustering operates on the latent factor matrices W generated from multiple
perturbed instances of the input matrix X. Let W all ∈ {0, 1}𝑛×𝑘 ×𝑀 represent a three dimensional tensor containing the
perturbed W matrices, where: 𝑛 is the number of rows in W, 𝑘 is the number of latent features, 𝑀 is the number of
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 13

perturbations, and W and H are Boolean following the thresholding techniques that will be introduced in Section 3.10.
Boolean clustering aims to iteratively align and compute the Boolean centroids for the perturbed W matrices using a
distance metric tailored for Boolean data, such as Hamming distance [45]. The Boolean clustering algorithm alternates
between two main steps: distance calculation and centroid computation, repeated until convergence or a maximum
number of iterations is reached.

1. Distance Calculation: For each perturbed W, the algorithm computes the distance between the current centroids
and the columns of W using a Boolean distance metric, such as Hamming distance. The columns of each perturbed W
are then reordered to minimize the total distance to the centroids.

2. Centroid Computation: The centroids are updated based on the reordered W matrices. Boolean centroids are
computed by aggregating the binary values of each column across all perturbations, ensuring that the centroids reflect
the majority consensus of the binary data.
Let W all represent the tensor of perturbed W matrices and C represent the centroids. Then the steps are:
(1) Initialization: The centroids are initialized using the first perturbed W matrix in W all .
(2) Distance Calculation: For each perturbed W, compute the distance between the centroids and the columns of
W and reorder the columns to minimize the distance.
(3) Centroid Update: Compute the Boolean centroids by aggregating the reordered columns of W across all
perturbations.
(4) Convergence Check: If the column ordering stabilizes across all iterations, terminate the algorithm; otherwise,
repeat the process.
Boolean clustering replaces the standard custom clustering step in NMFk, enabling BNMFk (or WNMFk and RNMFk
when Boolean settings are used) to effectively analyze Boolean datasets while maintaining consistency with the
underlying data structure.

3.10 Boolean Latent Factor Thresholding


Boolean latent factor thresholding ensures that the latent factor matrices W and H retain their Boolean structure while
minimizing the reconstruction error. This section presents three thresholding techniques–Otsu’s method, k-means
clustering, and coordinate descent–each of which determines thresholds to binarize W and H while preserving a close
approximation to the observed matrix X. Binarization converts continuous or multi-valued data into binary values (0s
and 1s) based on a predefined threshold.

3.10.1 Otsu’s Method. Otsu’s method [58] determines the optimal threshold for binarizing the latent factor matrices W
and H by maximizing the between-class variance in their values. Specifically, for each column W:,𝑖 or row H𝑖,: , Otsu’s
method computes a threshold 𝑡 ∗ that maximizes the separability between binary classes, ensuring that W, H ∈ {0, 1}.
The threshold 𝑡 ∗ is defined as:

𝑡 ∗ = arg max 𝜋0 (𝑡)𝜋1 (𝑡) (𝜇 0 (𝑡) − 𝜇1 (𝑡)) 2 ,


 
(28)
𝑡
where 𝜋0 (𝑡) and 𝜋1 (𝑡) are the probabilities (normalized counts) of values in W:,𝑖 or H𝑖,: below and above the threshold
𝑡, respectively, 𝜇 0 (𝑡) and 𝜇 1 (𝑡) are the means of the values below and above the threshold 𝑡, respectively. Otsu’s method
finds the optimal threshold by exhaustively evaluating all possible threshold values and selecting the one that maximizes
the between-class variance 𝜎𝐵2 (𝑡), defined below. For a given threshold 𝑡, the method partitions the data into two classes:
Manuscript submitted to ACM
14 Barron, Eren, Truong et al.

values below 𝑡 and values above 𝑡. It then calculates the probabilities 𝜋0 (𝑡) and 𝜋1 (𝑡) and the means 𝜇 0 (𝑡) and 𝜇 1 (𝑡) of
the two classes. The threshold that maximizes the separability of these two classes, as measured by 𝜎𝐵2 (𝑡), is chosen as
the optimal threshold. This ensures that the binary partitioning captures the most significant difference between the
two groups. For each component 𝑖 of W or H:

• Compute the histogram of the values in W:,𝑖 or H𝑖,: .


• Calculate 𝜋 0 (𝑡), 𝜋1 (𝑡), 𝜇 0 (𝑡), and 𝜇 1 (𝑡) for each potential threshold 𝑡.
• Select 𝑡 ∗ that maximizes the between-class variance:

𝜎𝐵2 (𝑡) = 𝜋 0 (𝑡)𝜋 1 (𝑡) (𝜇 0 (𝑡) − 𝜇 1 (𝑡)) 2 .

After determining 𝑡 ∗ , binarization is applied as follows:

if W𝑖 𝑗 ≥ 𝑡 ∗ if H𝑖 𝑗 ≥ 𝑡 ∗
 
 1,

  1,


W𝑖 𝑗 = , H𝑖 𝑗 = (29)
 0, if W𝑖 𝑗 < 𝑡 ∗
  0, if H𝑖 𝑗 < 𝑡 ∗ .

 
This approach ensures that the thresholds for W and H effectively partition the data into binary groups, maximizing
the separability between classes and preserving the Boolean structure of the latent factors.

3.10.2 K-Means Clustering. K-means clustering [59] thresholds each component of W and H by clustering their values
into two groups corresponding to 0 and 1. For a vector z (e.g., W:,𝑖 or H𝑖,: ), k-means clustering identifies two cluster
centers 𝑐 0 and 𝑐 1 . The threshold is then defined as:
𝑐0 + 𝑐1
𝑡∗ = . (30)
2
The binarization is performed by assigning:

if 𝑧 𝑗 ≥ 𝑡 ∗

 1,


𝑧𝑗 = (31)
 0,
 if 𝑧 𝑗 < 𝑡 ∗ .

3.10.3 Coordinate Descent Thresholding (search). Coordinate descent thresholding iteratively adjusts the thresholds for
W and H to minimize the reconstruction error X ≈ W ⊙ H. For each component 𝑖, the reconstruction error is computed
as:

Error = ||X − (W:,𝑖 H𝑖,: )|| 2𝐹 . (32)

The thresholds 𝑡𝑊 [𝑖] and 𝑡𝐻 [𝑖] for W:,𝑖 and H𝑖,: are optimized iteratively:

• Fix H and optimize W:,𝑖 by selecting 𝑡𝑊 [𝑖] to minimize the reconstruction error.
• Fix W and optimize H𝑖,: by selecting 𝑡𝐻 [𝑖] to minimize the reconstruction error.

The process repeats until the thresholds converge or the maximum number of iterations is reached. The final binarization
is applied as:

 1,

 if W:,𝑖 ≥ 𝑡𝑊 [𝑖]
W:,𝑖 = (33)
 0, if W:,𝑖 < 𝑡𝑊 [𝑖],


and similarly for H𝑖,: . In our experiments in Section 6, we use search as a term for this thresholding technique.
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 15

3.11 Uncertainty Quantification (UQ)


UQ is a critical component of robust predictive modeling, providing insights into the reliability of model predictions. In
the context of NMFk and its variants WNMFk, RNMFk, and BNMFk presented in this paper, UQ evaluates the stability
of the reconstructed matrix X̂ (Equation 14 for BNMFk and WNMFk, and Equation 18 for RNMFk) across perturbations,
using data augmentation to define the confidence, enabling the identification of confident and uncertain predictions. Our
UQ framework leverages the idea that truly confident predictions are stable under perturbation or data augmentation
[10, 75], which hypothesizes that stable predictions across multiple perturbations are more likely to be accurate. During
the NMFk process, perturbations of the input matrix X generate an ensemble of latent factor matrices W all and H all ,
corresponding to different realizations of W and H.
Under non-Boolean settings, perturbations are generated by uniformly sampling from X to create modified versions
that are a controlled distance away from the original matrix X. This is achieved by scaling the entries of X with random
noise drawn uniformly from the range [1 − 𝜖, 1 + 𝜖], where 𝜖 controls the magnitude of the perturbation. Then, the
perturbed matrix Y is defined as:

Y = X ⊙ (1 − 𝜖 + 2𝜖 · rand(shape(X))) , (34)

where: rand(shape(X)) generates random values uniformly distributed in [0, 1], and 𝜖 is the perturbation parameter. In
our experiments, we use 𝜖 = 0.015. Here, 𝜖 = 0.015 hyper-parameter is selected as it shown to give a stable region for 𝑘
selection [54]. The perturbation method described in Section 3.8 is used for Boolean settings. This approach modifies
the Boolean structure of X by flipping selected entries (from 0 to 1 or 1 to 0) based on the specified noise proportions
𝜖pos and 𝜖neg (we used (𝜖pos , 𝜖neg ) = (0.015, 0.015) in our experiments). For each perturbation 𝑝, the reconstructed
matrix X̂ (𝑝 ) is computed as:

X̂ (𝑝 ) = W (𝑝 ) H (𝑝 ) ,
X::𝑝 = W ::𝑝 H ::𝑝 ,
X̂ (35)

where W (𝑝 ) ∈ R𝑛×𝑘 + and H (𝑝 ) ∈ R𝑘+×𝑚 are the latent factors for perturbation 𝑝. For 𝑃 perturbations, we have
W::1, . . . , W ::𝑝 , . . . , W ::𝑃 ] and H 𝑎𝑙𝑙 = [H
W𝑎𝑙𝑙 = [W H ::1, . . . , H ::𝑝 , . . . , H ::𝑃 ], where W𝑎𝑙𝑙 and H 𝑎𝑙𝑙 are three dimensional
tensors of size 𝑛 × 𝑘 × 𝑃 and 𝑘 × 𝑚 × 𝑃. Across 𝑃 perturbations, the ensemble of reconstructed matrices {X̂ (𝑝 ) }𝑝=1
𝑃
𝑎𝑙𝑙 𝑎𝑙𝑙
X
reflects the variability in the model’s predictions, where X̂ X::1, . . . , X̂
= [X̂ X::𝑝 , . . . , X̂
X::𝑃 ], and where X̂
X is a three
dimensional tensor size of 𝑛 × 𝑚 × 𝑃. To quantify uncertainty, we compute the standard deviation of the reconstructed
values for each entry (𝑖, 𝑗) in X̂:
v
u
u
t 𝑃
1 ∑︁  (𝑝 ) ¯  2
U𝑖 𝑗 = X̂ − X̂𝑖 𝑗 , (36)
𝑃 𝑝=1 𝑖 𝑗

(𝑝 ) ¯ = 1 Í𝑃 X̂ (𝑝 ) is the mean of the reconstructed values at (𝑖, 𝑗) across all


where: X̂𝑖 𝑗 is the (𝑖, 𝑗) entry of X̂ (𝑝 ) , X̂𝑖𝑗 𝑃 𝑝=1 𝑖 𝑗
𝑃 perturbations. The resulting uncertainty matrix U captures the variability of predictions at each entry of X̂. Low
uncertainty values, or low standard deviation, indicate consistent predictions across perturbations, suggesting high
confidence in the model’s output at those entries. Conversely, high uncertainty values signal variability and reduced
confidence in predictions. Based on the hypothesis of truly confident predictions are stable under error [75], predictions
with low uncertainty are more likely to be accurate and reliable.

Manuscript submitted to ACM


16 Barron, Eren, Truong et al.

Fig. 2. Dog Dataset - Four binary images are used as Boolean latent features to generate the synthetic data of shape 400 × 16.

0 2 4 6 8 10 12 14 16 18

Fig. 3. Swimmer Dataset - Dataset of 16 swimmer images. The first and third rows are the images with real-valued intensities
ranging from 0 to 19. The second and fourth rows display the Boolean versions obtained after applying Otsu thresholding. For our
analysis, we use the Boolean versions of the dataset, represented as a matrix of size 1024 × 256.

4 Datasets
We evaluate our methods using three synthetic datasets (two Boolean, and one Gaussian distributed), assessing rank
prediction, Boolean and non-Boolean performance, data sparsity effects, UQ, and link prediction accuracy. Additionally,
we benchmark our methods against existing approaches using five real-world PPI network datasets. This section
summarizes the datasets and their characteristics.

4.1 Synthetic Datasets


The first dataset used in our experiments, referred to as the Dog dataset [19], is constructed by mixing four binary
images. This dataset includes four distinct images–sun, person, dog, and cloud-each with dimensions 20 × 20 pixels. The
Dog dataset is illustrated in Figure 2. To generate the dataset, we stack the columns of each image along the first axis,
and then create 16 unique samples by combining these stacked representations using Boolean addition, considering all
possible combinations of the four images. This results in a matrix X ∈ {0, 1}400×16 . Since the dataset comprises four
distinct images, the true Boolean rank of the Dog dataset is 𝑘 = 4.
The second dataset used in our experiments is the Swimmer dataset [48], which contains 16 synthetic images of a
"stick figure" swimming. Each image has dimensions of 32 × 32 pixels. Examples from this dataset are displayed in
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 17

Table 2. The statistics of the five PPI datasets include the number of proteins and positive and negative interaction pairs of each PPI
network when represented as a matrix.

Properties/Dataset H.sapiens-extended Brain Disease of Metabolism Liver Neurodegenerative Disease


Number of proteins 14,407 11,167 1,036 10,627 820
Number of positive pairs 157,950 225,200 5,131 218,239 5,881
Number of negative pairs 157,300 223,130 5,123 215,984 5,879

Figure 3. As shown in the first and third rows of Figure 3, the images contain "splashes" appearing as high-intensity
values away from the swimmer or locations with low values of swimmer figure representing that location being under
the water. We applied Otsu thresholding to convert the dataset into a Boolean format, with the resulting Boolean
images displayed in rows 2 and 4 of Figure 3. The locations of the "splashes" and the places where the swimmer is under
the water are visible in the Boolean version of the dataset. To construct the Boolean dataset, we stack each Boolean
image along the first axis and combine them using Boolean addition into 256 different samples, resulting in a matrix
X ∈ {0, 1}1024×256 . The true rank of the dataset is 𝑘 = 16, corresponding to the 16 distinct swimmer figures.
As the third synthetic dataset, we use randomly sampled matrices X from a normal distribution, leveraging the
framework introduced in [54]. We will call this dataset the Gaussian Dataset in our paper. We generate matrices
X ∈ R50×𝑚
+ , where 𝑚 ∈ {50, 100, 150, 200, 250, 300, 350, 400} (8 distinct shapes), corresponding to an increasing number
of features. For each matrix size, we set the true rank 𝑘 ∈ {2, 3, 4, 5, 6} (5 distinct true rank values). Each matrix size and
rank combination is randomly sampled ten times with different random seeds, resulting in 8 × 5 × 10 = 400 matrices.
While the two Boolean mentioned above and one non-Boolean synthetic datasets are used to assess the performance
of our methods under controlled conditions, we have also utilized larger real-world datasets to evaluate further and
validate our methods in practical scenarios.

4.2 PPI Datasets


This paper uses five PPI network datasets from [60]. Specifically, the PPI networks include Brain, Disease of Metabolism,
H. sapiens-extended, Liver, and Neurodegenerative Disease. The statistics of these five PPI networks are summarized in
Table 2. Our pre-processing of these datasets involves removing conflicting or inconsistent interactions. For example,
cases where protein 𝑖 is shown to interact positively with protein 𝑗, but the dataset also includes a second entry indicating
a negative interaction between the same proteins, are excluded. Additionally, we filter out proteins that interact with
fewer than five other proteins. This step follows a traditional pre-processing technique from recommender systems, as
described in [23], serving as a pruning approach to ensure the inclusion of proteins with sufficient interactions to allow
meaningful predictions.
Our experiments benchmark against the symNMF method results reported in [60]. At the same time, it is unclear
whether [60] applied similar pre-processing considerations, notably removing conflicting protein interactions. Therefore,
our analysis also includes a comparison with LMF (symLMF without the symmetry constraint) on the pre-processed
datasets used in this paper.

5 Experimental Setup
We evaluate the results on our synthetic datasets (Dog Data, Swimmer Data, and Gaussian Data) by randomly sampling
the missing links 10 times for cross-validation to ensure statistical significance. For the Boolean datasets, Dog Data and
Swimmer Data, missing links are randomly sampled by stratifying on 0s and 1s. This enables us to test the methods’
Manuscript submitted to ACM
18 Barron, Eren, Truong et al.

predictive capabilities on negative and positive interactions or known links. Additionally, we test the methods against
data sparsity on the Dog and Gaussian datasets by systematically increasing the test-set size from 10% to 90% in
increments of 10%.
Given 𝑦 total samples, we define the training set size as a proportion trainsize ∈ [0.1, 0.2, . . . , 0.9]. The number
of samples in the training set is 𝑦train = 𝑦 × trainsize . The test set consists of the remaining samples, with size
𝑦test = 𝑦 × (1 − trainsize ). For missing link prediction, we define the observation matrix X, where known links are
represented by nonzero values (X𝑖 𝑗 ≠ 0), and known negative links (i.e., known absence of a connection) are represented
by X𝑖 𝑗 = 0. The locations of known links are determined by the mask matrix M, where M𝑖 𝑗 = 1 indicates an observed
link in the training set, and M𝑖 𝑗 = 0 represents a missing link. To construct the training and test sets, we define the
index sets:

(1) Let Ipos be the set of indices where X𝑖 𝑗 ≠ 0 (known positive links).
(2) Let Ineg be the set of indices where X𝑖 𝑗 = 0 (known negative links).
(3) The test set index set, Itest , is chosen such that the number of indices satisfies |Itest | = 𝑦test .

The test set is constructed by stratified sampling from both known positive and known negative links:

(1) Positive known links: A subset of Ipos is randomly selected and placed in Itest .
(2) Negative known links: A subset of Ineg is randomly selected and placed in Itest .
(3) Training set: The remaining indices belong to the training index set Itrain = (Ipos ∪ Ineg ) \ Itest , where \
represents the set difference.

The mask matrix M is defined such that test set locations are masked out:

 0, if (𝑖, 𝑗) ∈ Itest


M𝑖 𝑗 = (37)
 1, otherwise.


The training matrix is then:

Xtrain = X ⊙ M, (38)
where ⊙ represents the element-wise (Hadamard) product. The test matrix consists of the missing edges:

Xtest = X ⊙ (1 − M). (39)


A prediction matrix X̂ is computed using a model trained on Xtrain . The performance is evaluated by comparing the
predicted values at test locations:

X̂test = X̂ ⊙ (1 − M), (40)


where we extract only the entries corresponding to the test set to measure prediction performance.
On the Swimmer dataset, we keep the test-set size fixed at 10% and do not test for data sparsity, as the computational
time required for this dataset is significantly longer due to the larger matrix. We set the test set size for the PPI dataset
to 20%, the same as in [60]. The missing links are randomly sampled 10 times to ensure robust evaluation. All results
are reported with an average over the cross-validations with coverage intervals (CIs).
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 19

5.1 Metrics
Performance metrics include rank 𝑘 predictions, Root Mean Squared Error (RMSE), abstained sample fraction, RMSE on
non-abstained samples, and Pearson Correlation between UQ entries and reconstruction error. We also report ROC
AUC and PR AUC, including UQ-based extensions, where UQ values from U at test set points Itest serve as weights,
simulating confidence-based predictions.

Rank 𝑘 Predictions. We assess the accuracy of the model’s automatic determination of the true rank 𝑘. Correct rank
prediction is critical for capturing the underlying structure of the data. Our results use a Violin plot [30] to show how
the rank predictions are distributed.

Root Mean Squared Error (RMSE). RMSE measures the reconstruction error, quantifying how closely the reconstructed
matrix X̂ approximates the observed matrix X at the test set points Itest :
v
t 2
1 ∑︁ 
RMSE = X𝑖 𝑗 − X̂𝑖 𝑗 . (41)
|Itest |
(𝑖,𝑗 ) ∈ Itest

A lower RMSE indicates better reconstruction accuracy. We use RMSE to report the performance of predicting the
missing links.

Fraction of Abstained or Rejected Samples. This metric evaluates the proportion of predictions the model abstains from
due to high uncertainty, focusing on confident predictions while disregarding less certain ones. The fraction of abstained
samples is calculated as follows:
|Iabstain |
𝑓abstain = , (42)
|Itest |
where 𝑓abstain is the fraction of abstained predictions, |Iabstain | is the number of abstained predictions, and |Itest | is
the total number of test set samples. The coverage rate can be calculated with 1 − 𝑓abstain , referring to the fraction of
non-abstained samples. The threshold for rejecting predictions in missing links is based on the uncertainty values in U.
For a given uncertainty matrix U, the threshold for abstaining (reject-option) 𝜏 is defined as:
1 ∑︁
𝜏= U𝑖 𝑗 , (43)
|Itrain |
(𝑖,𝑗 ) ∈ Itrain

where Itrain is the set of all training-set indices (known links), and U𝑖 𝑗 is the uncertainty value at location (𝑖, 𝑗).
Predictions at test-set indices (𝑖, 𝑗) ∈ Itest are abstained if:

U𝑖 𝑗 > 𝜏, (44)

where Itest represents the set of test-set indices. This ensures that the model avoids making predictions in locations
with higher uncertainty than the average certainty observed in the training set (known links).

RMSE on Non-Rejected or Non-Abstained Samples. This metric calculates the RMSE only on the predictions that are not
abstained:
v
t 2
1 ∑︁ 
RMSENon-Abstained = ′ | X𝑖 𝑗 − X̂𝑖 𝑗 , (45)
|Itest ′
(𝑖,𝑗 ) ∈ Itest

Manuscript submitted to ACM


20 Barron, Eren, Truong et al.

′ is the subset of I
where Itest test containing non-abstained predictions. This highlights the accuracy of the most confident
predictions.

ROC, AUC, and PR AUC. Receiver Operating Characteristic (ROC) AUC and Precision-Recall (PR) AUC are standard
metrics for evaluating classification performance. ROC AUC measures the ability to distinguish between positive and
negative classes, while PR AUC evaluates precision and recall trade-offs:
∫ 1 ∫ 1
ROC AUC = TPR(𝐹 𝑃𝑅) 𝑑 (FPR), PR AUC = Precision(𝑅𝑒𝑐𝑎𝑙𝑙) 𝑑 (𝑅𝑒𝑐𝑎𝑙𝑙). (46)
0 0
Our results include their UQ-based extensions incorporating UQ values U as weights, simulating a scenario where
confidence is assigned to predictions. The weights are calculated based on the UQ values and normalized to account for
overall uncertainty. Specifically, the weight for each prediction is defined as:
U𝑖 𝑗
𝑤𝑖 𝑗 = . (47)
1 + median(U𝑘𝑙 for (𝑘, 𝑙) ∈ Itrain )
The normalized weight is given by:
1
𝑤𝑖norm
𝑗 = . (48)
1 + 𝑤𝑖 𝑗
Using these normalized weights, the UQ-based ROC AUC and PR AUC metrics are computed as:
Í norm · TPR(FPR)
(𝑖,𝑗 ) ∈ Itest 𝑤𝑖 𝑗
UQ - ROC AUC = Í norm , (49)
(𝑖,𝑗 ) ∈ Itest 𝑤𝑖 𝑗

Í norm · Precision(Recall)
(𝑖,𝑗 ) ∈ Itest 𝑤𝑖 𝑗
UQ - PR AUC = Í norm . (50)
(𝑖,𝑗 ) ∈ Itest 𝑤𝑖 𝑗

We incorporate these weights into the ROC AUC and PR AUC calculations by utilizing the sample_weight parameter
provided in Scikit-learn [59]. This allows us to account for the uncertainty-based weights during the evaluation, ensuring
that higher-confidence predictions contribute more significantly to the metrics. In contrast, less confident predictions
have a reduced impact, reflecting the effect of uncertainty on classification performance.

5.2 System Configuration


We ran the experiments on an HPC cluster named Dracarys, located at the Los Alamos National Laboratory (LANL).
Dracarys uses the AMD EPYC 9454 48-core processor at a clock speed of 3.81GHz. There are 192 virtual processors and
a total physical RAM of 1.97 TeraBytes (TBs). The system also comprises 8 NVIDIA H100 GPUs with VRAM memory of
82 GigaBytes (GBs) each.

6 Experiments
This section presents our methods’ performance, starting with synthetic datasets under Boolean and non-Boolean
settings. We evaluate rank prediction, link prediction, performance under increasing sparsity, and UQ utility in controlled
settings with pre-determined ranks. Next, we showcase results on PPI networks, demonstrating real-world effectiveness.
Additional synthetic dataset results are reported in Appendix A.
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 21

BMFkkmeans NMFk WMFk


Test-set Size

K Prediction K Prediction K Prediction


0.5
0.4
0.3
RMSE

0.2
0.1
0.0
1
2

10

1
2

10

1
2

10
k k k
Test-set Size & K True
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 K True = 4

Fig. 4. Results for Dog Data across methods including BNMFkkmeans , NMFk, and WNMFk. Boolean thresholding is not used for
NMFk and WNMFk. The first row presents violin plots visualizing the rank 𝑘 predictions at different test-set size levels. The second
row displays RMSE scores for the test set, demonstrating the missing link prediction performance. The results are reported for each
rank 𝑘 on the x-axis, with the dark/dashed vertical line across columns being the true rank 𝑘 = 4.

6.1 Dog Data


Our results on the Dog Data are presented in Figures 4, 5, and 6, where each column displays the performance of our
methods and methods under different settings. For completeness, we have a more comprehensive version of these
results in Appendix A.1 in Figure 13.
Figure 4 compares BNMFkkmeans , NMFk, and WNMFk, where the k-means subscript denotes Boolean thresholding
via k-means clustering (Section 3.10). In the first row, BNMFkkmeans predicts the true rank 𝑘 = 4 more consistently
than NMFk and WNMFk, as expected for a Boolean decomposition method on Boolean Dog Data. However, for test-set
sizes above 0.4, BNMFkkmeans shifts toward 𝑘 = 1, while NMFk and WNMFk frequently predict ranks above 𝑘 = 4,
capturing non-Boolean structures. Both BNMFkkmeans and WNMFk predict lower ranks as sparsity increases. The
second row shows RMSE scores, where lower values indicate better link prediction. BNMFkkmeans achieves the lowest
RMSE at smaller test sets, with RMSE increasing as test-set size grows, limiting pattern learning. RMSE is minimized
near 𝑘 = 4, emphasizing the importance of correct rank selection.
Figure 5 expands on these results, comparing Boolean thresholding techniques for BNMFk, NMFk, and WNMFk
(Section 3.10). Each column represents a method using a different thresholding approach, denoted by subscripts kmeans,
otsu, and search, with search referring to coordinate descent-based thresholding. For instance, BNMFkotsu applies
Otsu thresholding. The first row presents rank prediction violin plots, while the second row reports RMSE scores.
Compared to Figure 4, Boolean thresholding helps NMFk and WNMFk predict 𝑘 = 4 more frequently, though they still
tend to overestimate ranks at smaller test sizes, whereas BNMFk predicts 𝑘 = 4 or lower. RMSE results confirm Boolean
settings improve link prediction, especially for WNMFkkmeans and WNMFksearch . BNMFk remains the best for rank
prediction and lower RMSE, except with Otsu thresholding, which yields higher RMSE values. Across all methods,
Manuscript submitted to ACM
22 Barron, Eren, Truong et al.

BMFkkmeans BMFkotsu BMFksearch NMFkkmeans NMFkotsu NMFksearch WMFkkmeans WMFkotsu WMFksearch

Test-set Size

K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction


0.6
RMSE

0.4

0.2

0.0
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
k k k k k k k k k
Test-set Size & K True
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 K True = 4

Fig. 5. Results for Dog Data across methods, including BNMFk, NMFk, and WNMFk, evaluated under different Boolean thresholding
techniques. The Boolean thresholding techniques are denoted with the subscripts of kmeans, otsu, and search (coordinate descent).
The first row presents violin plots visualizing the rank 𝑘 predictions at different test-set sizes. The second row displays RMSE scores
for the test set, demonstrating the missing link prediction performance. The results are reported for each rank 𝑘 on the x-axis, with
the dark and dashed vertical line across each column is the true rank 𝑘 = 4.

BMFkkmeans BMFkotsu BMFksearch NMFkkmeans NMFkotsu NMFksearch WMFkkmeans WMFkotsu WMFksearch


0.0
RMSE Change

0.1
0.2
0.3
0.4
0.6
Fraction Abstained

0.4

0.2

0.0
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
1
2
4
6
8
10
k k k k k k k k k
Test-set Size & K True
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 K True = 4

Fig. 6. Results for Dog Data across methods, including BNMFk, NMFk, and WNMFk, evaluated under different Boolean thresholding
techniques. The Boolean thresholding techniques are denoted with the subscripts of kmeans, otsu, and search (coordinate descent).
The first row presents the change in RMSE after making predictions on the non-abstained samples, where a negative change indicates
improved performance (lower RMSE) for the given fraction of abstained samples. The second row illustrates the fraction of abstained
predictions, representing the proportion of cases where the model opted not to make a prediction due to uncertainty. The results are
reported for each rank 𝑘 on the x-axis, with the dark and dashed vertical line across each column indicating the true rank 𝑘 = 4.

a trend of improved RMSE scores at or around the true rank can be observed in this figure, further highlighting the
importance of rank selection.
Figure 6 highlights UQ’s role in improving predictions. The first row shows the RMSE change after applying UQ,
where negative values indicate improved performance (reduced RMSE value) by abstaining from uncertain predictions.
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 23

104
Elapsed Time (sec)

103

102
BMFkkmeans BMFkotsu BMFksearch BMFkuniform NMFk NMFkkmeans NMFkotsu NMFksearch WMFk WMFkkmeans WMFkotsu WMFksearch
Method

Fig. 7. Computation times for each method are shown in the log scale on the Dog Data. The results indicate that methods leveraging
coordinate descent (search) for Boolean thresholding require significantly longer computation times than other approaches.

BNMFkkmeans 0.20 0.20


BNMFkotsu

Fraction Abstained
RMSE (abstained)
BNMFkuniform 0.15 0.15 0.4
NMFk
Method

RMSE

NMFkkmeans
NMFkotsu 0.10 0.10 0.2
WNMFk
WNMFkkmeans 0.05 0.05
WNMFkotsu 0.0
12
4
6
8
10
12
14
16
18
20

12
4
6
8
10
12
14
16
18
20

12
4
6
8
10
12
14
16
18
20

12
4
6
8
10
12
14
16
18
20
K Prediction k k k
Method & K True
BNMFkkmeans BNMFkuniform NMFkkmeans WNMFk WNMFkotsu
BNMFkotsu NMFk NMFkotsu WNMFkkmeans K True = 16

Fig. 8. The performance of each method on the Swimmer Data is shown. The first column features a violin plot displaying the
distribution of 𝑘 predictions for each method, with the y-axis representing the methods and the x-axis representing the predicted 𝑘
values. The second through final columns present results at each 𝑘 decomposition value, with the x-axis corresponding to the rank 𝑘.
The dark dashed vertical lines indicate this dataset’s true rank of 𝑘 = 16. The Boolean thresholding techniques are denoted with the
sub-scripts of kmeans, otsu, and search (coordinate descent), while uniform sub-script refers to running BNMFk without Boolean
thresholding on the latent factors. NMFk and WNMFk without sub-scripts are the results of not using Boolean techniques with these
methods.

The second row presents the fraction of abstained predictions at each test-set size, with higher fractions indicating lower
coverage. UQ effectively filters uncertain predictions, lowering RMSE in several cases compared to Figure 5 (without UQ).
Methods like BNMFksearch and WNMFkkmeans show notable RMSE reduction. Abstention rates generally increase
with test-set size, reflecting greater uncertainty in sparse data, until around a test-set size of 0.8. Beyond this point, the
training set becomes too small, resulting in no meaningful abstentions. However, at the true rank 𝑘 = 4, abstention
rates decline for several methods, which combined with observations in Figure 5 showing improved performance at
true rank, indicate higher confidence in predictions at the correct rank.
Figure 7 shows the computational time for each method in seconds. While Figure 4 demonstrated that coordinate
descent-based Boolean thresholding yields better results, its improvement comes at significantly higher computational
time. As shown, methods using the search setting have substantially longer run times. Given this, we exclude the search
method from subsequent experiments.

6.2 Swimmer Data


We analyze the Swimmer dataset, with results in Figure 8. The test-set size is fixed at 0.1 (10%) without sparsity analysis.
Figure 8 shows 𝑘-predictions (first column), RMSE scores (second), RMSE with abstention (third), and the fraction of
abstained samples (fourth). The true rank 𝑘 = 16 is marked by a dark dashed vertical line, with the x-axis representing
Manuscript submitted to ACM
24 Barron, Eren, Truong et al.

WNMFk RNMFk
6
5 K True
Predicted K 2
4 3
3 4
5
2 6
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Test-set Size Test-set Size
Fig. 9. Results for 𝑘 predictions on Gaussian Data are shown for WNMFk and RNMFk methods across different data sparsity levels,
represented by the increasing test-set size on the x-axis. The y-axis indicates the predicted 𝑘 values. The results demonstrate that 𝑘
predictions remain accurate for both methods until high levels of sparsity are reached, while WNMFk results in more accurate 𝑘
predictions. Line color and style differentiate between different true 𝑘 values. For instance, a solid line with circle markers represents
decompositions where the matrix rank was 𝑘 = 2.

𝑘 values. Each method is color-coded, while line styles distinguish Boolean settings in columns 2-4. Figure 8 shows
that BNMFk with Boolean settings consistently identified 𝑘 = 16, as did WNMFk without Boolean settings. WNMFk
with k-means thresholding predicted both 𝑘 = 16 and 𝑘 = 17, while BNMFkuniform (without Boolean settings) always
predicted 𝑘 = 1. Other methods produced 𝑘 values between 𝑘 = 1 and 𝑘 = 17. Some non-Boolean methods predicted
𝑘 = 16 because the Swimmer dataset’s non-negative and Boolean ranks are equal.
In the second column of Figure 8, NMFkkmeans , WNMFkotsu , WNMFkkmeans , and BNMFkkmeans improve as 𝑘
nears 16, after which RMSE increases. Notably, BNMFkkmeans continues improving until 𝑘 = 18 before RMSE rises,
mirroring Dog Data trends. Non-Boolean decompositions maintain higher RMSE (lower performance) across all 𝑘,
confirming Boolean methods perform better on Boolean data. The third column shows UQ significantly improves RMSE,
even for non-Boolean methods, though less so for BNMFkuniform . The RMSE remains stable across ranks as models
abstain from uncertain predictions. However, as a trade-off, non-Boolean methods like WNMFk and NMFk exhibit a
higher reduction in coverage, shown by a greater fraction of abstained samples in the last column.

6.3 Gaussian Data


Our final synthetic dataset, Gaussian Data, is analyzed in Figure 9 for WNMFk and RNMFk, assessing their rank
prediction at different test-set sizes. No Boolean settings are used. Color and line style differentiate true ranks; for
example, an orange line with 𝑥 markers represents 𝑘 = 3. WNMFk predicts the correct rank until a test-set size of 0.7,
after which predictions skew toward 𝑘 = 1. RNMFk follows a similar trend but loses rank prediction accuracy earlier at
higher true ranks. For instance, at 𝑘 = 6 (purple line with diamond markers), lower rank predictions appear after a
test-set size of 0.5.
We examine link prediction performance using RMSE in log scale for test-set sizes 0.1 and 0.5 in Figure 10, where
blue lines represent 0.5 and red lines 0.1. Shapes differentiate true ranks; for example, square markers indicate 𝑘 = 4,
with blue squares for 𝑘 = 4 at 0.5. For WNMFk, RMSE increases beyond the true rank, reflecting reduced performance,
with the best results at the correct rank. This trend holds at higher sparsity (0.5 test-set size) but is less pronounced due
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 25

WNMFk RNMFk
100 Test-set Size
0.1
0.5
RMSE (log)

6 × 10 1
K True
4 × 10 1 2
3
3 × 10 1 4
5
2 × 10 1 6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
K K
Fig. 10. Link prediction performance on test-set sizes of 10% (red lines) and 50% (blue lines) is shown for WNMFk and RNMFk, with
RMSE scores plotted on a log scale. The x-axis represents the 𝑘 values, displaying results at different 𝑘 decompositions. Line markers
are used to differentiate between the true 𝑘 values. For example, solid red lines with circle markers represent matrices with a true
rank of 𝑘 = 2 and a test-set size of 10%.

Test-set Size
1.0 0.1 0.5 0.9
0.8
Pearson Correlation Coefficient
K True=2

0.6
0.4
0.2
1.0
0.8
K True=3

0.6
0.4
0.2
WNMFk RNMFk WNMFk RNMFk WNMFk RNMFk
k
1 3 5 7 9 11 13 15
2 4 6 8 10 12 14

Fig. 11. Pearson Correlation Coefficients with CI between UQ and errors at the test set are reported across different sparsity levels or
test-set sizes that include 0.1, 0.5, and 0.9 (columns) and true matrix ranks 𝑘 including 𝑘 ∈ [2, 3] (rows). The shared x-axis differentiates
between the methods WNMFk and RNMFk using hues, while the shared y-axis represents the Pearson correlation values. Bars with
different colors correspond to results at various 𝑘 decompositions.

to overall performance decline. RNMFk follows a similar pattern but shows more stable link prediction, as indicated by
flatter RMSE lines, suggesting more consistent performance across ranks.
Manuscript submitted to ACM
26 Barron, Eren, Truong et al.

3.5 K True
3.0 2
3

UQ Std/Mean
2.5 4
5
2.0 6
1.5
1.0
Normalized UQ Std/Mean
1.5

1.0

0.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Test-set Size

Fig. 12. Relative-Std-to-Mean Ratio of Uncertainty plot provides a quantitative perspective on the behavior of uncertainty
estimates across varying sparsity levels (or test set size) for each 𝑘 true. The top figure shows the standard deviation (std) at test points
divided by the overall average of certainty (UQ). The bottom figure investigates certainty saturation by normalizing the std-to-mean
uncertainty ratio at the top plot with the correlation between the UQ and errors at test points.

For Gaussian Data, we analyze the correlation between UQ and errors in Figure 11 to assess whether models assign
higher uncertainty to higher-error points. These trends align with the broader results in Appendix A.2, Figure 14, which
examines test-set sizes (0.1-0.9) and true ranks (𝑘 = 2-6). Figure 11 shows Pearson Correlation Coefficients and CI
between UQ and relative reconstruction errors (Equation 4). Columns represent test-set sizes, rows correspond to true
ranks 𝑘 = 2 and 𝑘 = 3, and the y-axis denotes correlation coefficients. Hue differentiates WNMFk and RNMFk, with
color bars indicating decomposition results for each 𝑘.
Figure 11 shows that correlations between certainty and error remain high for RNMFk and WNMFk, confirming that
UQ effectively assigns higher uncertainty to high-error points. Correlation is lower before the true rank and peaks at
the correct rank, especially for WNMFk and smaller test-set sizes, indicating reliable certainty estimates. In the second
row (𝑘 = 3), the orange bars for WNMFk are notably higher, highlighting improved performance at this rank. Another
key trend in Figure 11 is the reduction in correlation between certainty and error at the largest test-set size of 0.9.
Our hypothesis for this behavior is that at higher sparsity levels, where the training set is significantly reduced, the
uncertainty estimates may saturate at consistently high values across all test points, thereby reducing variability. We
have also observed a relevant and similar result in Figure 6 where the fraction of abstained samples reduced at higher
test set sizes. When uncertainty does not vary meaningfully, its correlation with actual errors naturally diminishes. This
saturation reflects a breakdown in the model’s ability to provide meaningful uncertainty estimates, signaling that the
model is operating in a regime of extreme sparsity. Under such conditions, there are significantly more missing links to
predict while relying on minimal information from the training set. As a result, the model’s uncertainty quantification
methods become less reliable.
The "Relative-Std-to-Mean Ratio of Uncertainty" plot, shown in the top row of Figure 12, provides a quantitative
perspective on the behavior of uncertainty estimates across varying levels of sparsity (or test-set size). This metric is
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 27

calculated for the test locations based on the uncertainty matrix U and is defined as:
𝜎
SMR = U , (51)
𝜇U
where SMR is the standard deviation (std)-to-mean ratio, 𝜎U is the standard deviation of UQ values at test locations,
and 𝜇 U is the mean of UQ values at test locations, calculated as:
√︂
1 ∑︁ 1 ∑︁
𝜎U = (𝑖, 𝑗) ∈ Itest (U𝑖 𝑗 − Ū) 2, 𝜇U = (𝑖, 𝑗) ∈ Itest U𝑖 𝑗, (52)
|Itest | |Itest |
where Itest is the set of test locations, and Ū is the mean uncertainty value at the test locations. This metric tracks
the relationship between the variability of uncertainty (measured as the standard deviation of uncertainty) and the
average uncertainty (mean of UQ) at the test points. By normalizing variability through its mean, the std-to-mean ratio
captures whether the relative dispersion in uncertainty estimates decreases significantly as sparsity increases. In the
top row of Figure 12, the std-to-mean ratio drops until around a 0.5 test-set size and then stabilizes. Meanwhile, the
correlation between errors and UQ remains high until a test-set size of 0.7 or 0.8, as shown in the Appendix Figure
14. The drop in the std-to-mean ratio suggests that the variability of uncertainty at test locations is becoming more
consistent relative to its mean but does not necessarily indicate saturation (since the correlation remains high). The
normalized std-to-mean ratio plot, shown in the second row of Figure 12, refines this measure by incorporating the
correlation between UQ and errors. This metric is defined as:
SMR · 𝑟
NSMR = , (53)
max(𝑟 )
where 𝑟 is the vector of correlation values between UQ and reconstruction error. This normalization adjusts the standard
deviation-to-mean ratio (SMR) by weighting it with the correlation between UQ and error, scaled by the maximum
correlation value observed. This scaling emphasizes variability in uncertainty estimates that predict error rather
than those that fluctuate without meaningful correlation to errors. By incorporating the correlation term, this metric
provides a refined measure of how uncertainty dispersion (std-to-mean ratio) relates to actual predictive reliability.
Their significance diminishes if uncertainty estimates are highly variable but poorly correlated with errors. Conversely,
when variability in uncertainty remains well-correlated with errors, it suggests that the uncertainty estimates are
informative.
In the second row of Figure 12, higher values, observed until around a test-set size of 0.7, indicate that the uncertainty
estimates remain variable and meaningfully related to errors. This suggests that UQ values are informative about
prediction confidence. The observed decline beyond a test-set size of 0.8, where both the std-to-mean ratio in the first
row of Figure 12 and the correlation in Figure 11 are low, suggests that uncertainty estimates are saturating. This
saturation implies that the model’s certainty values no longer distinguish between correct and incorrect predictions,
reducing their usefulness at higher sparsity levels.

6.4 PPI Data


We next benchmark our methods against LMF and symLMF [60]. Table 3 presents the benchmark results on the PPI
networks: H. sapiens-extended, Brain, Disease of Metabolism, Liver, and Neurodegenerative Disease. For BNMFk, we
employ k-means clustering for Boolean thresholding, as described in Section 3.10. At the same time, RNMFk and
WNMFk are evaluated without Boolean settings, following their original implementations as described in Sections 3.4
and 3.5, respectively. Additionally, we include the LMF-extended results for WNMFklmf , BNMFklmf , and RNMFklmf ,
Manuscript submitted to ACM
28 Barron, Eren, Truong et al.

Method
Dataset/Metric
BNMFk RNMFk WNMFk BNMFklmf RNMFklmf WNMFklmf LMF symLMF [60]
H.sapiens-extended
ROC AUC 0.755 ± 0.022 0.941 ± 0.001 0.951 ± 0.003 0.959 ± 0.002 0.955 ± 0.001 0.955 ± 0.001 0.949 ± 0.001 0.944 ± 0.001
PR AUC 0.884 ± 0.009 0.957 ± 0.001 0.964 ± 0.003 0.969 ± 0.002 0.965 ± 0.001 0.965 ± 0.001 0.934 ± 0.001 0.955 ± 0.001
UQ - ROC AUC 0.762 ± 0.027 0.942 ± 0.001 0.954 ± 0.003 0.962 ± 0.004 0.956 ± 0.001 0.957 ± 0.001 – –
UQ - PR AUC 0.880 ± 0.010 0.958 ± 0.001 0.966 ± 0.002 0.969 ± 0.004 0.966 ± 0.001 0.967 ± 0.001 – –
Brain
ROC AUC 0.721 ± 0.011 0.943 ± 0.001 0.941 ± 0.001 0.955 ± 0.001 0.954 ± 0.001 0.953 ± 0.001 0.931 ± 0.001 0.952 ± 0.001
PR AUC 0.849 ± 0.004 0.953 ± 0.001 0.948 ± 0.001 0.962 ± 0.001 0.960 ± 0.001 0.959 ± 0.002 0.921 ± 0.001 0.957 ± 0.001
UQ - ROC AUC 0.727 ± 0.010 0.944 ± 0.001 0.945 ± 0.002 0.960 ± 0.001 0.955 ± 0.001 0.956 ± 0.002 – –
UQ - PR AUC 0.840 ± 0.004 0.953 ± 0.001 0.952 ± 0.002 0.960 ± 0.002 0.961 ± 0.001 0.962 ± 0.002 – –
Disease of Metabolism
ROC AUC 0.771 ± 0.025 0.958 ± 0.006 0.959 ± 0.003 0.969 ± 0.007 0.966 ± 0.008 0.966 ± 0.007 0.968 ± 0.002 0.911 ± 0.006
PR AUC 0.861 ± 0.014 0.958 ± 0.007 0.957 ± 0.003 0.968 ± 0.008 0.963 ± 0.009 0.963 ± 0.008 0.943 ± 0.003 0.926 ± 0.005
UQ - ROC AUC 0.785 ± 0.024 0.961 ± 0.007 0.962 ± 0.003 0.974 ± 0.006 0.968 ± 0.008 0.967 ± 0.007 – –
UQ - PR AUC 0.863 ± 0.017 0.961 ± 0.009 0.957 ± 0.003 0.969 ± 0.006 0.965 ± 0.009 0.962 ± 0.008 – –
Liver
ROC AUC 0.712 ± 0.007 0.944 ± 0.001 0.941 ± 0.001 0.956 ± 0.001 0.954 ± 0.001 0.953 ± 0.001 0.933 ± 0.001 0.952 ± 0.0004
PR AUC 0.844 ± 0.004 0.954 ± 0.001 0.948 ± 0.001 0.962 ± 0.001 0.961 ± 0.001 0.960 ± 0.001 0.925 ± 0.001 0.958 ± 0.0004
UQ - ROC AUC 0.719 ± 0.009 0.945 ± 0.001 0.945 ± 0.002 0.961 ± 0.002 0.955 ± 0.001 0.957 ± 0.002 – –
UQ - PR AUC 0.833 ± 0.003 0.955 ± 0.001 0.951 ± 0.002 0.960 ± 0.002 0.962 ± 0.001 0.963 ± 0.001 – –
Neurodegenerative Disease
ROC AUC 0.765 ± 0.035 0.968 ± 0.006 0.957 ± 0.005 0.976 ± 0.003 0.973 ± 0.002 0.974 ± 0.002 0.974 ± 0.002 0.941 ± 0.005
PR AUC 0.873 ± 0.015 0.973 ± 0.006 0.961 ± 0.005 0.981 ± 0.003 0.978 ± 0.002 0.979 ± 0.002 0.963 ± 0.002 0.952 ± 0.004
UQ - ROC AUC 0.769 ± 0.029 0.970 ± 0.006 0.959 ± 0.005 0.979 ± 0.003 0.975 ± 0.002 0.975 ± 0.001 – –
UQ - PR AUC 0.859 ± 0.016 0.975 ± 0.006 0.962 ± 0.006 0.978 ± 0.003 0.979 ± 0.002 0.979 ± 0.002 – –
Table 3. The performance of BNMFk, RNMFk, and WNMFk on five PPI datasets is reported and compared against their LMF-based
extensions, indicated by the subscript lmf, i.e. WNMFklmf , BNMFklmf , and RNMFklmf . These results are also benchmarked against
the LMF and symLMF methods. Average results are reported with ± two standard deviation. Performance metrics include ROC AUC
and PR AUC, as well as ROC AUC and PR AUC with UQ-based weighting. Bolded scores indicate the best-performing methods
for ROC AUC and PR AUC metrics across all approaches. For UQ-based scores, underlined values highlight instances where the
inclusion of UQ-based weighting improved the scores compared to their non-UQ-weighted counterparts. Scores for symLMF are
directly taken from [60]. Here, we run BNMFk with kmeans clustering for Boolean thresholding and RNMFk and WNMFk without
Boolean thresholding. Reported LMF scores correspond to the LMF decomposition at various rank 𝑘 values that are predicted across
the rank-𝑘 estimations of BNMFk, RNMFk, and WNMFk for all cross-validation runs on the given PPI datasets.

as described in Section 3.7. Performance metrics include ROC AUC, PR AUC, and UQ-based scores, detailed in Section
5.1.
Table 3 shows BNMFk has the lowest performance, while RNMFk and WNMFk consistently match or outperform LMF
and symLMF across datasets. For example, on H.sapiens-extended, RNMFk achieves an ROC AUC of 0.941, compared
to 0.949 for LMF and 0.944 for symLMF, with a higher PR AUC of 0.957 (vs. 0.934 for LMF and 0.955 for symLMF).
Ensemble models yield higher performance, with BNMFklmf achieving the highest scores across datasets, showing
that combining models with LMF improves performance over standalone versions. On the Brain PPI dataset, BNMFk
attains a PR AUC of 0.849, RNMFk 0.953, and WNMFk 0.948, while their LMF extensions reach 0.962, 0.960, and
0.959, respectively. These scores surpass LMF alone, indicating complementary model strengths under the ensemble
framework. UQ-based weighting slightly improves scores in some cases, as shown by underlined ROC AUC and PR
AUC values, though the gains are small for these datasets. These results on five real-world PPI networks validate the
efficacy of our methods beyond synthetic datasets. The next section summarizes the technical capabilities and software
development considerations behind our Python library.
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 29

7 T-ELF: Python Library


We have publicly released all methods introduced in this paper through a user-friendly Python library, T-ELF [21]2 .
The library supports multi-processing via threading for concurrent operations, allowing each 𝑘 decomposition in
the automatic model determination process to run in separate processes or GPUs. T-ELF uses Numpy [28] for CPU
operations and CuPy with CuPyx [57] for GPU-based dense and sparse matrix computations. HPC capabilities are
enabled through OpenMPI, managed via mpi4py [16, 17], allowing further parallelization across HPC compute nodes
for scalability. T-ELF is an installable Python package available via pip [1] or Poetry, with example Jupyter Notebooks
[40] and unit tests integrated into GitHub’s CI pipeline. It includes comprehensive documentation, hosted via Sphinx
and GitHub Pages [63], with installation guides for environments like Conda [2]. To make the library easy to use, T-ELF
follows an API style inspired by Scikit-learn [59].

8 Conclusion
In this paper, we introduced novel matrix factorization extensions for link prediction: Weighted (WNMFk), Boolean
(BNMFk), and Recommender (RNMFk), along with ensemble variants using logistic factorization. These methods address
traditional limitations by integrating automatic model determination and UQ. Our framework heuristically estimates
the optimal rank 𝑘 via a modified bootstrap-based stability and accuracy analysis, while UQ enhances reliability by
abstaining from uncertain predictions. Compared with coordinate descent-based thresholding, we applied new Boolean
thresholding techniques, including Otsu’s method and k-means clustering for Boolean Matrix Factorization. Experiments
on three synthetic datasets validated rank selection, assessed data sparsity effects, and tested link prediction performance
in Boolean and non-Boolean settings. For the practical relevance, we benchmarked our methods against LMF and
symLMF on five real-world PPI networks, demonstrating improved predictions. Finally, we released our methods as a
user-friendly Python library on GitHub, supporting multi-processing, GPU acceleration, and HPC environments for
large-scale applications.

Acknowledgments
This manuscript has been approved for unlimited release and has been assigned LA-UR-25-22115. This work was
funded by a grant HDTRA1242032(CB11198) of BSA, from the Defense Threat Reduction Agency (DTRA) of the U.S.
Department of Defense (DoD). The funders had no role in study design, data collection and interpretation, or the
decision to submit the work for publication. Funds for the demonstration and/or assessment work were provided by the
Los Alamos National Laboratory Technology Evaluation & Demonstration program. The research was also supported
by LANL Institutional Computing Program, and by the U.S. DOE NNSA under Contract No. 89233218CNA000001.

References
[1] [n. d.]. Python Package Index - PyPI. https://pypi.org/
[2] 2024. Anaconda Software Distribution. https://docs.anaconda.com/
[3] Mohiuddin Ahmed, Abdun Naser Mahmood, and Jiankun Hu. 2016. A survey of network anomaly detection techniques. Journal of Network and
Computer Applications 60 (2016), 19–31. https://doi.org/10.1016/j.jnca.2015.11.016
[4] Boian Alexandrov, Velimir Vesselinov, and Kim Orskov Rasmussen. 2021. SmartTensors Unsupervised AI Platform for Big-Data Analytics. Technical
Report. Los Alamos National Lab.(LANL), Los Alamos, NM (United States). https://www.lanl.gov/collaboration/smart-tensors/ LA-UR-21-25064.
[5] Boian S Alexandrov, Ludmil B Alexandrov, Filip L Iliev, Valentin G Stanev, and Velimir V Vesselinov. 2020. Source identification by non-negative
matrix factorization combined with semi-supervised clustering. US Patent 10,776,718.

2 Available at https://github.com/lanl/T-ELF
Manuscript submitted to ACM
30 Barron, Eren, Truong et al.

[6] Ludmil B. Alexandrov, Jaegil Kim, Nicholas J. Haradhvala, Mi Ni Huang, Alvin Wei Tian Ng, Yang Wu, Arnoud Boot, Kyle R. Covington, Dmitry A.
Gordenin, Erik N. Bergstrom, S. M. Ashiqul Islam, Nuria Lopez-Bigas, Leszek J. Klimczak, John R. McPherson, Sandro Morganella, Radhakrishnan
Sabarinathan, David A. Wheeler, Ville Mustonen, Paul Boutros, Kin Chan, Akihiro Fujimoto, Gad Getz, Marat Kazanov, Michael Lawrence, Iñigo
Martincorena, Hidewaki Nakagawa, Paz Polak, Stephenie Prokopec, Steven A. Roberts, Steven G. Rozen, Natalie Saini, Tatsuhiro Shibata, Yuichi
Shiraishi, Michael R. Stratton, Bin Tean Teh, Ignacio Vázquez-García, Fouad Yousif, Willie Yu, Lauri A. Aaltonen, Federico Abascal, Adam Abeshouse,
Hiroyuki Aburatani, David J. Adams, Nishant Agrawal, Keun Soo Ahn, Sung-Min Ahn, Hiroshi Aikata, Rehan Akbani, Kadir C. Akdemir, Hikmat
Al-Ahmadie, Sultan T. Al-Sedairy, Fatima Al-Shahrour, Malik Alawi, Monique Albert, Kenneth Aldape, Adrian Ally, Kathryn Alsop, Eva G. Alvarez,
Fernanda Amary, Samirkumar B. Amin, Brice Aminou, Ole Ammerpohl, Matthew J. Anderson, Yeng Ang, Davide Antonello, Pavana Anur, Samuel
Aparicio, Elizabeth L. Appelbaum, Yasuhito Arai, Axel Aretz, Koji Arihiro, Shun-ichi Ariizumi, Joshua Armenia, Laurent Arnould, Sylvia Asa, Yassen
Assenov, Gurnit Atwal, Sietse Aukema, J. Todd Auman, Miriam R. R. Aure, Philip Awadalla, Marta Aymerich, Gary D. Bader, Adrian Baez-Ortega,
Matthew H. Bailey, Peter J. Bailey, Miruna Balasundaram, Saianand Balu, Pratiti Bandopadhayay, Rosamonde E. Banks, Stefano Barbi, Andrew P.
Barbour, Jonathan Barenboim, Jill Barnholtz-Sloan, Hugh Barr, Elisabet Barrera, John Bartlett, Javier Bartolome, Claudio Bassi, Oliver F. Bathe, Daniel
Baumhoer, Prashant Bavi, Stephen B. Baylin, Wojciech Bazant, Duncan Beardsmore, Timothy A. Beck, Sam Behjati, Andreas Behren, Beifang Niu,
Cindy Bell, Sergi Beltran, Christopher Benz, Andrew Berchuck, Anke K. Bergmann, Benjamin P. Berman, Daniel M. Berney, Stephan H. Bernhart,
Rameen Beroukhim, Mario Berrios, Samantha Bersani, Johanna Bertl, Miguel Betancourt, Vinayak Bhandari, Shriram G. Bhosle, Andrew V. Biankin,
Matthias Bieg, Darell Bigner, Hans Binder, Ewan Birney, Michael Birrer, Nidhan K. Biswas, Bodil Bjerkehagen, Tom Bodenheimer, Lori Boice,
Giada Bonizzato, Johann S. De Bono, Moiz S. Bootwalla, Ake Borg, Arndt Borkhardt, Keith A. Boroevich, Ivan Borozan, Christoph Borst, Marcus
Bosenberg, Mattia Bosio, Jacqueline Boultwood, Guillaume Bourque, Paul C. Boutros, G. Steven Bova, David T. Bowen, Reanne Bowlby, David
D. L. Bowtell, Sandrine Boyault, Rich Boyce, Jeffrey Boyd, Alvis Brazma, Paul Brennan, Daniel S. Brewer, Arie B. Brinkman, Robert G. Bristow,
Russell R. Broaddus, Jane E. Brock, Malcolm Brock, Annegien Broeks, Angela N. Brooks, Denise Brooks, Benedikt Brors, Søren Brunak, Timothy
J. C. Bruxner, Alicia L. Bruzos, Alex Buchanan, Ivo Buchhalter, Christiane Buchholz, Susan Bullman, Hazel Burke, Birgit Burkhardt, Kathleen H.
Burns, John Busanovich, Carlos D. Bustamante, Adam P. Butler, Atul J. Butte, Niall J. Byrne, Anne-Lise Børresen-Dale, Samantha J. Caesar-Johnson,
Andy Cafferkey, Declan Cahill, Claudia Calabrese, Carlos Caldas, Fabien Calvo, Niedzica Camacho, Peter J. Campbell, Elias Campo, Cinzia Cantù,
Shaolong Cao, Thomas E. Carey, Joana Carlevaro-Fita, Rebecca Carlsen, Ivana Cataldo, Mario Cazzola, Jonathan Cebon, Robert Cerfolio, Dianne E.
Chadwick, Dimple Chakravarty, Don Chalmers, Calvin Wing Yiu Chan, Michelle Chan-Seng-Yue, Vishal S. Chandan, David K. Chang, Stephen J.
Chanock, Lorraine A. Chantrill, Aurélien Chateigner, Nilanjan Chatterjee, Kazuaki Chayama, Hsiao-Wei Chen, Jieming Chen, Ken Chen, Yiwen
Chen, Zhaohong Chen, Andrew D. Cherniack, Jeremy Chien, Yoke-Eng Chiew, Suet-Feung Chin, Juok Cho, Sunghoon Cho, Jung Kyoon Choi, Wan
Choi, Christine Chomienne, Zechen Chong, Su Pin Choo, Angela Chou, Angelika N. Christ, Elizabeth L. Christie, Eric Chuah, Carrie Cibulskis,
Kristian Cibulskis, Sara Cingarlini, Peter Clapham, Alexander Claviez, Sean Cleary, Nicole Cloonan, Marek Cmero, Colin C. Collins, Ashton A.
Connor, Susanna L. Cooke, Colin S. Cooper, Leslie Cope, Vincenzo Corbo, Matthew G. Cordes, Stephen M. Cordner, Isidro Cortés-Ciriano, Kyle
Covington, Prue A. Cowin, Brian Craft, David Craft, Chad J. Creighton, Yupeng Cun, Erin Curley, Ioana Cutcutache, Karolina Czajka, Bogdan
Czerniak, Rebecca A. Dagg, Ludmila Danilova, Maria Vittoria Davi, Natalie R. Davidson, Helen Davies, Ian J. Davis, Brandi N. Davis-Dusenbery,
Kevin J. Dawson, Francisco M. De La Vega, Ricardo De Paoli-Iseppi, Timothy Defreitas, Angelo P. Dei Tos, Olivier Delaneau, John A. Demchok,
PCAWG Mutational Signatures Working Group, and P. C. A. W. G. Consortium. 2020. The repertoire of mutational signatures in human cancer.
Nature 578, 7793 (01 Feb 2020), 94–101. https://doi.org/10.1038/s41586-020-1943-3
[7] Ludmil B. Alexandrov, Serena Nik-Zainal, David C. Wedge, Samuel A. J. R. Aparicio, Sam Behjati, Andrew V. Biankin, Graham R. Bignell, Niccolò
Bolli, Ake Borg, Anne-Lise Børresen-Dale, Sandrine Boyault, Birgit Burkhardt, Adam P. Butler, Carlos Caldas, Helen R. Davies, Christine Desmedt,
Roland Eils, Jórunn Erla Eyfjörd, John A. Foekens, Mel Greaves, Fumie Hosoda, Barbara Hutter, Tomislav Ilicic, Sandrine Imbeaud, Marcin Imielinski,
Natalie Jäger, David T. W. Jones, David Jones, Stian Knappskog, Marcel Kool, Sunil R. Lakhani, Carlos López-Otín, Sancha Martin, Nikhil C. Munshi,
Hiromi Nakamura, Paul A. Northcott, Marina Pajic, Elli Papaemmanuil, Angelo Paradiso, John V. Pearson, Xose S. Puente, Keiran Raine, Manasa
Ramakrishna, Andrea L. Richardson, Julia Richter, Philip Rosenstiel, Matthias Schlesner, Ton N. Schumacher, Paul N. Span, Jon W. Teague, Yasushi
Totoki, Andrew N. J. Tutt, Rafael Valdés-Mas, Marit M. van Buuren, Laura van ’t Veer, Anne Vincent-Salomon, Nicola Waddell, Lucy R. Yates, Jessica
Zucman-Rossi, P. Andrew Futreal, Ultan McDermott, Peter Lichter, Matthew Meyerson, Sean M. Grimmond, Reiner Siebert, Elías Campo, Tatsuhiro
Shibata, Stefan M. Pfister, Peter J. Campbell, Michael R. Stratton, Australian Pancreatic Cancer Genome Initiative, ICGC Breast Cancer Consortium,
ICGC MMML-Seq Consortium, and I. C. G. C. PedBrain. 2013. Signatures of mutational processes in human cancer. Nature 500, 7463 (01 Aug 2013),
415–421. https://doi.org/10.1038/nature12477
[8] Ludmil B Alexandrov, Serena Nik-Zainal, David C Wedge, Peter J Campbell, and Michael R Stratton. 2013. Deciphering signatures of mutational
processes operative in human cancer. Cell reports 3, 1 (2013), 246–259.
[9] Norah Saleh Alghamdi, Fatma Taher, Heba Kandil, Ahmed Sharafeldeen, Ahmed Elnakib, Ahmed Soliman, Yaser ElNakieb, Ali Mahmoud, Mohammed
Ghazal, and Ayman El-Baz. 2022. Segmentation of Infant Brain Using Nonnegative Matrix Factorization. Applied Sciences 12, 11 (2022). https:
//doi.org/10.3390/app12115377
[10] Yuval Bahat and Gregory Shakhnarovich. 2018. Confidence from Invariance to Image Transformations. ArXiv abs/1804.00657 (2018). https:
//api.semanticscholar.org/CorpusID:4562935
[11] Christopher M Bishop. 1999. Bayesian pca. Advances in neural information processing systems (1999), 382–388.
[12] Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. 2004. Metagenes and molecular pattern discovery using matrix factorization.
Proceedings of the national academy of sciences 101, 12 (2004), 4164–4169.
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 31

[13] Yun Cai, Hong Gu, and Toby Kenney. 2023. Rank selection for non-negative matrix factorization. Statistics in Medicine 42, 30 (2023), 5676–5693.
https://doi.org/10.1002/sim.9934 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.9934
[14] Jiahang Cao, Jinyuan Fang, Zaiqiao Meng, and Shangsong Liang. 2024. Knowledge graph embedding: A survey from the perspective of representation
spaces. Comput. Surveys 56, 6 (2024), 1–42.
[15] Guangfu Chen, Chen Xu, Jingyi Wang, Jianwen Feng, and Jiqiang Feng. 2020. Robust non-negative matrix factorization for link prediction in
complex networks using manifold regularization and sparse learning. Physica A: Statistical Mechanics and its Applications 539 (2020), 122882.
[16] Lisandro Dalcin and Yao-Lung L. Fang. 2021. mpi4py: Status Update After 12 Years of Development. Computing in Science & Engineering 23, 4 (2021),
47–54. https://doi.org/10.1109/MCSE.2021.3083216
[17] Lisandro D. Dalcin, Rodrigo R. Paz, Pablo A. Kler, and Alejandro Cosimo. 2011. Parallel distributed computing using Python. Advances in Water
Resources 34, 9 (2011), 1124–1139. https://doi.org/10.1016/j.advwatres.2011.04.013 New Computational Methods and Software Tools.
[18] Derek DeSantis, Erik Skau, Duc P Truong, and Boian Alexandrov. 2022. Factorization of binary matrices: Rank relations, uniqueness and model
selection of boolean decomposition. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 6 (2022), 1–24.
[19] Derek Desantis, Erik Skau, Duc P. Truong, and Boian Alexandrov. 2022. Factorization of Binary Matrices: Rank Relations, Uniqueness and Model
Selection of Boolean Decomposition. ACM Trans. Knowl. Discov. Data 16, 6, Article 112 (July 2022), 24 pages. https://doi.org/10.1145/3522594
[20] Liang Duan, Shuai Ma, Charu Aggarwal, Tiejun Ma, and Jinpeng Huai. 2017. An Ensemble Approach to Link Prediction. IEEE Transactions on
Knowledge and Data Engineering 29, 11 (2017), 2402–2416. https://doi.org/10.1109/TKDE.2017.2730207
[21] Maksim Eren, Nick Solovyev, Ryan Barron, Manish Bhattarai, Duc Truong, Ismael Boureima, Erik Skau, Kim Ø. Rasmussen, and Boian Alexandrov.
2023. Tensor Extraction of Latent Features (T-ELF). https://doi.org/10.5281/zenodo.10257897
[22] Maksim E. Eren, Manish Bhattarai, Robert J. Joyce, Edward Raff, Charles Nicholas, and Boian S. Alexandrov. 2023. Semi-Supervised Classification of
Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection. ACM Trans.
Priv. Secur. 26, 4, Article 48 (Nov. 2023), 27 pages. https://doi.org/10.1145/3624567
[23] Maksim E. Eren, Manish Bhattarai, Nicholas Solovyev, Luke E. Richards, Roberto Yus, Charles Nicholas, and Boian S. Alexandrov. 2022. One-Shot
Federated Group Collaborative Filtering. In 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA). 647–652.
https://doi.org/10.1109/ICMLA55696.2022.00107
[24] Maksim E. Eren, Juston S. Moore, Erik Skau, Elisabeth Moore, Manish Bhattarai, Gopinath Chennupati, and Boian S. Alexandrov. 2023. General-
purpose Unsupervised Cyber Anomaly Detection via Non-negative Tensor Factorization. Digital Threats 4, 1, Article 6 (March 2023), 28 pages.
https://doi.org/10.1145/3519602
[25] Maksim E Eren, Luke E Richards, Manish Bhattarai, Roberto Yus, Charles Nicholas, and Boian S Alexandrov. 2022. FedSPLIT: One-Shot Federated
Recommendation System Based on Non-negative Joint Matrix Factorization and Knowledge Distillation. arXiv preprint arXiv:2205.02359 (2022).
https://arxiv.org/abs/2205.02359
[26] Cédric Févotte and A Taylan Cemgil. 2009. Nonnegative matrix factorizations as probabilistic inference in composite models. In 17th European
Signal Processing Conference. 1913–1917.
[27] M. Girvan and M. E. J. Newman. 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99,
12 (2002), 7821–7826. https://doi.org/10.1073/pnas.122653799 arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.122653799
[28] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian
Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río,
Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and
Travis E. Oliphant. 2020. Array programming with NumPy. Nature 585, 7825 (Sept. 2020), 357–362. https://doi.org/10.1038/s41586-020-2649-2
[29] Winston Haynes. 2013. Wilcoxon Rank Sum Test. Springer New York, New York, NY, 2354–2355. https://doi.org/10.1007/978-1-4419-9863-7_1185
[30] Jerry L. Hintze and Ray D. Nelson. 1998. Violin Plots: A Box Plot-Density Trace Synergism. The American Statistician 52, 2 (1998), 181–184.
https://doi.org/10.1080/00031305.1998.10480559 arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00031305.1998.10480559
[31] Mehdi Hosseinzadeh Aghdam. 2022. A novel constrained non-negative matrix factorization method based on users and items pairwise relationship
for recommender systems. Expert Systems with Applications 195 (2022), 116593. https://doi.org/10.1016/j.eswa.2022.116593
[32] Kexin Huang, Ying Jin, Emmanuel Candes, and Jure Leskovec. 2023. Uncertainty Quantification over Graph with Conformalized Graph Neural
Networks. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.),
Vol. 36. Curran Associates, Inc., 26699–26721. https://proceedings.neurips.cc/paper_files/paper/2023/file/54a1495b06c4ee2f07184afb9a37abda-Paper-
Conference.pdf
[33] Nicolas Hug. 2020. Surprise: A Python library for recommender systems. Journal of Open Source Software 5, 52 (2020), 2174. https://doi.org/10.
21105/joss.02174
[34] S.M. Ashiqul Islam, Marcos Diaz-Gay, Yang Wu, Mark Barnes, Raviteja Vangara, Erik N. Bergstrom, Yudou He, Mike Vella, Jingwei Wang, Jon W.
Teague, Peter Clapham, Sarah Moody, Sergey Senkin, Yun Rose Li, Laura Riva, Tongwu Zhang, Andreas J. Gruber, Christopher D. Steele, Burcak
Otlu, Azhar Khandekar, Ammal Abbasi, Laura Humphreys, Natalia Syulyukina, Samuel W. Brady, Boian S. Alexandrov, Nischalan Pillay, Jinghui
Zhang, David J. Adams, Inigo Martincorena, David C. Wedge, Maria Teresa Landi, Paul Brennan, Michael R. Stratton, Steven G. Rozen, and Ludmil B.
Alexandrov. 2022. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genomics 2, 11 (2022), 100179.
https://doi.org/10.1016/j.xgen.2022.100179

Manuscript submitted to ACM


32 Barron, Eren, Truong et al.

[35] Yu Ito, Shin ichi Oeda, and Kenji Yamanishi. [n. d.]. Rank Selection for Non-negative Matrix Factorization with Normalized Maximum Likelihood
Coding. 720–728. https://doi.org/10.1137/1.9781611974348.81 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611974348.81
[36] Xin Jin and Jiawei Han. 2010. K-Means Clustering. Springer US, Boston, MA, 563–564. https://doi.org/10.1007/978-0-387-30164-8_425
[37] Christopher C. Johnson. 2014. Logistic Matrix Factorization for Implicit Feedback Data. https://api.semanticscholar.org/CorpusID:1451516
[38] Yong-Deok Kim and Seungjin Choi. 2009. Weighted nonnegative matrix factorization. In 2009 IEEE International Conference on Acoustics, Speech and
Signal Processing. 1541–1544. https://doi.org/10.1109/ICASSP.2009.4959890
[39] Yong-Deok Kim and Seungjin Choi. 2009. Weighted nonnegative matrix factorization. In 2009 IEEE International Conference on Acoustics, Speech and
Signal Processing. 1541–1544. https://doi.org/10.1109/ICASSP.2009.4959890
[40] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick,
Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing, and Jupyter development team. 2016. Jupyter Notebooks
- a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas,
Fernando Loizides and Birgit Scmidt (Eds.). IOS Press, Netherlands, 87–90. https://eprints.soton.ac.uk/403913/
[41] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (2009), 30–37.
https://doi.org/10.1109/MC.2009.263
[42] Donghyuk Lee, Difei Wang, Xiaohong R Yang, Jianxin Shi, Maria Teresa Landi, and Bin Zhu. 2022. SUITOR: selecting the number of mutational
signatures through cross-validation. PLoS Computational Biology 18, 4 (2022), e1009309.
[43] Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401 (1999), 788–791.
https://api.semanticscholar.org/CorpusID:4428232
[44] Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788–791.
[45] Stan Z. Li and Anil Jain (Eds.). 2009. Hamming Distance. Springer US, Boston, MA, 668–668. https://doi.org/10.1007/978-0-387-73003-5_956
[46] Qinghua Liu, Andrew Henry Reiner, Arnoldo Frigessi, and Ida Scheel. 2019. Diverse personalized recommendations with uncertainty from implicit
preference data with the Bayesian Mallows model. Knowledge-Based Systems 186 (2019), 104960. https://doi.org/10.1016/j.knosys.2019.104960
[47] Yong Liu, Min Wu, Chunyan Miao, Peilin Zhao, and Xiao-Li Li. 2016. Neighborhood Regularized Logistic Matrix Factorization for Drug-Target
Interaction Prediction. PLOS Computational Biology 12, 2 (02 2016), 1–26. https://doi.org/10.1371/journal.pcbi.1004760
[48] Xindi Ma, Jie Gao, Xiaoyu Liu, Taiping Zhang, and Yuanyan Tang. 2021. Probabilistic Non-Negative Matrix Factorization with Binary Components.
Mathematics 9, 11 (2021). https://doi.org/10.3390/math9111189
[49] David JC MacKay. 1994. Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions 100, 2 (1994), 1053–1062.
[50] Xueyu Mao, Purnamrita Sarkar, and Deepayan Chakrabarti. 2017. On mixed memberships and symmetric nonnegative matrix factorizations. In
Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML’17). JMLR.org, 2324‚Äì2333.
[51] Pauli Miettinen and Stefan Neumann. 2020. Recent developments in boolean matrix factorization. arXiv preprint arXiv:2012.03127 (2020).
[52] Morten Mørup and Lars Kai Hansen. 2009. Tuning pruning in sparse non-negative matrix factorization. In 2009 17th European Signal Processing
Conference. IEEE, 1923–1927.
[53] Laura Muzzarelli, Susanne Weis, Simon B. Eickhoff, and Kaustubh R. Patil. 2019. Rank Selection in Non-negative Matrix Factorization: systematic
comparison and a new MAD metric. In 2019 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN.2019.
8852146
[54] Benjamin T Nebgen, Raviteja Vangara, Miguel A Hombrados-Herrera, Svetlana Kuksova, and Boian S Alexandrov. 2021. A neural network for
determination of latent dimensionality in non-negative matrix factorization. Machine Learning: Science and Technology 2, 2 (2021), 025012.
[55] Elena Nenova, Dmitry I Ignatov, and Andrey V Konstantinov. 2013. An FCA-based boolean matrix factorisation for collaborative filtering. arXiv
preprint arXiv:1310.4366 (2013).
[56] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016. A review of relational machine learning for knowledge graphs.
Proc. IEEE 104, 1 (2016), 11–33. https://doi.org/10.1109/JPROC.2015.2483592
[57] Ryosuke Okuta, Yuya Unno, Daisuke Nishino, Shohei Hido, and Crissman Loomis. 2017. CuPy: A NumPy-Compatible Library for NVIDIA GPU
Calculations. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information
Processing Systems (NIPS). http://learningsys.org/nips17/assets/papers/paper_16.pdf
[58] Nobuyuki Otsu. 1979. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 1 (1979),
62–66. https://doi.org/10.1109/TSMC.1979.4310076
[59] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825–2830.
[60] Fen Pei, Qingya Shi, Haotian Zhang, and Ivet Bahar. 2021. Predicting Protein-Protein Interactions Using Symmetric Logistic Ma-
trix Factorization. Journal of Chemical Information and Modeling 61, 4 (2021), 1670–1682. https://doi.org/10.1021/acs.jcim.1c00173
arXiv:https://doi.org/10.1021/acs.jcim.1c00173 PMID: 33831302.
[61] Lihua Peng, Yue Zhao, and Xiaolin Zhang. 2023. Matrix factorization for missing link prediction with negative sample selection. PLOS ONE 18, 7
(2023), e0289568. https://doi.org/10.1371/journal.pone.0289568
[62] Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987),
53–65. https://doi.org/10.1016/0377-0427(87)90125-7
[63] Sphinx Development Team. 2024. Sphinx 7.3.7+ Documentation. https://www.sphinx-doc.org/en/master/
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 33

[64] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva,
John H Morris, Peer Bork, Lars J Jensen, and Christian von Mering. 2018. STRING v11: protein–protein association networks with increased
coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research 47, D1 (11 2018), D607–D613. https:
//doi.org/10.1093/nar/gky1131 arXiv:https://academic.oup.com/nar/article-pdf/47/D1/D607/27437323/gky1131.pdf
[65] Vincent YF Tan and Cédric Févotte. 2012. Automatic relevance determination in nonnegative matrix factorization with the/spl beta/-divergence.
IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 7 (2012), 1592–1605.
[66] Minghu Tang. 2022. A Joint Weighted Nonnegative Matrix Factorization Model via Fusing Attribute Information for Link Prediction. Springer
Nature, 190–205.
[67] Minghu Tang, Wei Yu, Xiaoming Li, Xue Chen, Wenjun Wang, and Zhen Liu. 2022. Cold-Start Link Prediction via Weighted Symmetric Nonnegative
Matrix Factorization with Graph Regularization. Computer Systems Science and Engineering 43, 3 (2022), 1069–1084. https://doi.org/10.32604/csse.
2022.028841
[68] Duc P. Truong, Erik Skau, Derek Desantis, and Boian Alexandrov. 2021. Boolean Matrix Factorization via Nonnegative Auxiliary Optimization. IEEE
Access 9 (2021), 117169–117177. https://doi.org/10.1109/ACCESS.2021.3107189
[69] Raviteja Vangara, Manish Bhattarai, Erik Skau, Gopinath Chennupati, Hristo Djidjev, Thomas Tierney, James P Smith, Valentin G Stanev, and
Boian S Alexandrov. 2021. Finding the Number of Latent Topics with Semantic Non-negative Matrix Factorization. IEEE Access (2021).
[70] Raviteja Vangara, Erik Skau, Gopinath Chennupati, Hristo Djidjev, Thomas Tierney, James P Smith, Manish Bhattarai, Valentin G Stanev, and
Boian S Alexandrov. 2020. Semantic nonnegative matrix factorization with automatic model determination for topic modeling. In 2020 19th IEEE
International Conference on Machine Learning and Applications (ICMLA). IEEE, 328–335.
[71] Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, and Tat-Seng Chua. 2019. Explainable reasoning over knowledge graphs for
recommendation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 5329–5336.
[72] Yangyang Xu, Wotao Yin, Zaiwen Wen, and Yin Zhang. 2012. An alternating direction algorithm for matrix completion with nonnegative factors.
Frontiers of Mathematics in China 7, 2 (2012), 365–384.
[73] Zhong Xu, Wen-Xu Du, and Zhong-Qian Liu. 2016. Link prediction via matrix factorization: A perturbation-based approach. Scientific Reports 6
(2016), 38938. https://doi.org/10.1038/srep38938
[74] Yabing Yao, Yaling He, Zhenyu Huang, et al. 2024. Deep non-negative matrix factorization with edge generator for link prediction in complex
networks. Applied Intelligence 54 (2024), 592–613. https://doi.org/10.1007/s10489-023-05211-1
[75] Xu-Yao Zhang, Guo-Sen Xie, Xiuli Li, Tao Mei, and Cheng-Lin Liu. 2023. A Survey on Learning to Reject. Proc. IEEE 111, 2 (2023), 185–215.
https://doi.org/10.1109/JPROC.2023.3238024
[76] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. 2018. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics
34, 13 (2018), i457–i466. https://doi.org/10.1093/bioinformatics/bty294

A Appendix
In the Appendix, we provide additional results presented in Figures 4, 5, 6, and 11. With these figures, a subset of these
results was presented in the main text for simplicity and clarity. We include the remaining details in the Appendix with
Figures 13 and 14 to view the results comprehensively.

A.1 Dog Data Results (Expanded Version)


Our results on the Dog Data are presented in Figure 13 and provide a more detailed view of method performance under
both Boolean and non-Boolean settings. This figure expands upon the results shown in Figures 4, 5, and 6 by including
all methods together and also showing results for RMSE, RMSE abstained, and Fraction Abstained.
The first row presents violin plots of predicted ranks 𝑘 across test-set sizes, illustrating how each method estimates the
correct rank under varying sparsity levels. The second row reports test-set RMSE, highlighting link prediction accuracy.
The third row examines RMSE after UQ, considering only non-abstained predictions. The fourth row shows the fraction
of abstained predictions, where higher values indicate a lower coverage rate. The trends in Figure 13 align with the
main text. BNMFk, using boolean threshold techniques from Section 3.10 (denoted by subscripts), predicts 𝑘 = 4 more
frequently than NMFk and WNMFk, especially at lower sparsity. BNMFk’s rank predictions shift downward as test-set
size grows, reflecting adaptation to increased sparsity. Non-Boolean methods (WNMFk, NMFk, and BNMFkuniform )
often predict higher ranks, capturing different structural properties. RMSE results confirm that Boolean thresholding
Manuscript submitted to ACM
34 Barron, Eren, Truong et al.

BMFkkmeans BMFkotsu BMFksearch BMFkuniform NMFk NMFkkmeans NMFkotsu NMFksearch WMFk WMFkkmeans WMFkotsu WMFksearch

Test-set Size

K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction K Prediction
0.6
RMSE

0.4
0.2
0.0
Fraction Abstained RMSE (abstained)

0.6
0.4
0.2
0.0
0.6

0.4

0.2

0.0
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
12
4
6
8
10
k k k k k k k k k k k k
Test-set Size & K True
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 K True = 4

Fig. 13. Results for Dog Data across each method, including Boolean and non-Boolean variations (columns). The Boolean thresholding
techniques are denoted with the subscripts of kmeans, otsu, and search (coordinate descent). For example, the absence of a subscript,
WNMFk, refers to not using a Boolean thresholding technique with that given method. For BNMFk, sub-script uniform refers to
running BNMFk without the factor Boolean thresholding. The first row presents violin plots visualizing the rank 𝑘 predictions at
different sparsity or test-set sizes. The second row displays RMSE scores for the missing link prediction performance test set. The
third row shows RMSE scores calculated for non-abstained predictions when applying UQ. The fourth row illustrates the fraction of
abstained predictions, indicating the proportion of cases where the model chose to abstain ("I do not know") rather than make a
prediction. The results are reported for each rank 𝑘, the x-axis in the plots, while the dark and dashed vertical line across each column
is the true rank 𝑘 = 4.

improves link prediction, particularly for NMFk and WNMFk. Performance declines with larger test sets, as indicated
by the color shift from green to red, but Boolean methods maintain the lowest RMSE near the correct rank. RMSE
changes from abstention (third row) show that UQ reduces errors by ignoring uncertain predictions, leading to lower
RMSE scores than in the second row. The fraction of abstained predictions (fourth row) increases with test-set size, yet
confidence improves near the correct rank for several of the methods. This figure adds granularity beyond the main
text, reinforcing conclusions on rank prediction accuracy, RMSE trends, and UQ’s role in improving predictions.

A.2 Uncertainty-Error Correlation Results (Expanded Version)


Figure 14 presents a more detailed analysis of the correlation between UQ and prediction errors in the Gaussian Data.
This figure extends the results shown in Figure 11 by covering a broader range of test-set sizes (0.1 to 0.9, in increments
of 0.1) and true ranks (𝑘 from 2 to 6).
The Pearson Correlation Coefficients and CIs are reported across different sparsity levels (columns) and true matrix
ranks (rows). The x-axis differentiates WNMFk and RNMFk by hue, while the y-axis represents Pearson correlation
Manuscript submitted to ACM
Matrix Factorization for Inferring Associations and Missing Links 35

Test-set Size
1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.8
K True=2

0.6
0.4
0.2
0.0
1.0
0.8
K True=3

0.6
Pearson Correlation Coefficient

0.4
0.2
0.0
1.0
0.8
K True=4

0.6
0.4
0.2
0.0
1.0
0.8
K True=5

0.6
0.4
0.2
0.0
1.0
0.8
K True=6

0.6
0.4
0.2
0.0
WNMFk RNMFk WNMFk RNMFk WNMFk RNMFk WNMFk RNMFk WNMFk RNMFk WNMFk RNMFk WNMFk RNMFk WNMFk RNMFk WNMFk RNMFk
Method Method Method Method Method Method Method Method Method
k
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Fig. 14. Pearson Correlation Coefficients with CI between UQ and errors are reported across test-set sizes (columns) and true ranks 𝑘
(rows). The x-axis differentiates WNMFk and RNMFk by hue, while the y-axis represents correlation values. Colored bars indicate
results for different 𝑘 decompositions.

values, with each color bar corresponding to a specific decomposition rank 𝑘. Here, WNMFk and RNMFk do not use
Boolean thresholding. The observed trends align with those in the main text across a broader range of test-set sizes
and true ranks. Correlation between uncertainty and error is generally high, as models assign greater uncertainty to
locations with larger prediction errors. Correlation peaks at the true rank, especially for lower test-set sizes in WNMFk,
but declines at extreme sparsity levels (test-set sizes above 0.7 or 0.8). This supports the hypothesis that uncertainty
saturates at high values across all test points, reducing variability and weakening its correlation with actual errors. The
figure provides a more comprehensive view of UQ and error interactions, complementing the main text results.

Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

Manuscript submitted to ACM

You might also like